Why my SQL UDF cannot be registered?

2014-12-15 Thread Xuelin Cao
Hi,      I tried to create a function that to convert an Unix time stamp to the hour number in a day.       It works if the code is like this:sqlContext.registerFunction("toHour", (x:Long)=>{new java.util.Date(x*1000).getHours})       But, if I do it like this, it doesn't work:       def toHour

When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Hi,       In Spark SQL help document, it says "Some of these (such as indexes) are less important due to Spark SQL’s in-memory  computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes)"      For our

Re: When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
ioned table support?  That would only scan data where the predicate matches the partition.  Depending on the cardinality of the customerId column that could be a good option for you. On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao wrote: Hi,       In Spark SQL help document, it says "Some of thes

Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Xuelin Cao
Hi,        I'm testing parquet file format, and the predicate pushdown is a very useful feature for us.        However, it looks like the predicate push down doesn't work after I set        sqlContext.sql("SET spark.sql.parquet.filterPushdown=true")        Here is my sql:       sqlContext.sql("

Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Xuelin Cao
Hi,        Curious and curious. I'm puzzled by the Spark SQL cached table.       Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL.       However, in my test, I always see the whole table is scanned even though I only "select" one column i

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-07 Thread Xuelin Cao
null pointers >> when there are full row groups that are null. >> >> https://issues.apache.org/jira/browse/SPARK-4258 >> >> You can turn it on if you want: >> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration >> >> Daniel >> >>

Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Hi, Currently, we are building up a middle scale spark cluster (100 nodes) in our company. One thing bothering us is, the how spark manages the resource (CPU, memory). I know there are 3 resource management modes: stand-along, Mesos, Yarn In the stand along mode, the cluster maste

Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
on, I think it's important to see > what other applications you want to be running besides Spark in the same > cluster and also your use cases, to see what resource management fits your > need. > > Tim > > > On Wed, Jan 7, 2015 at 10:55 PM, Xuelin Cao > wrote: > >

Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
not. > > Tim > > On Wed, Jan 7, 2015 at 11:19 PM, Xuelin Cao > wrote: > >> >> Hi, >> >> Thanks for the information. >> >> One more thing I want to clarify, when does Mesos or Yarn allocate >> and release the resource? Aka,

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
the input data for each task is also 1212.5MB On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian wrote: > Hey Xuelin, which data item in the Web UI did you check? > > > On 1/7/15 5:37 PM, Xuelin Cao wrote: > > > Hi, > >Curious and curious. I'm puzzl

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
) > > The input data of the first statement is 292KB, the second is 49.1KB. > > The JSON file I used is examples/src/main/resources/people.json, I copied > its contents multiple times to generate a larger file. > > Cheng > > On 1/8/15 7:43 PM, Xuelin Cao wrote: > > &

Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Xuelin Cao
Hi, I'm wondering whether it is a good idea to overcommit CPU cores on the spark cluster. For example, in our testing cluster, each worker machine has 24 physical CPU cores. However, we are allowed to set the CPU core number to 48 or more in the spark configuration file. As a result,

Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread Xuelin Cao
ng etc.). > > Why not increase the tasks per core? > > Best regards > Le 9 janv. 2015 06:46, "Xuelin Cao" a écrit : > > >> Hi, >> >> I'm wondering whether it is a good idea to overcommit CPU cores on >> the spark cluster. >> >>

How to create an empty RDD with a given type?

2015-01-12 Thread Xuelin Cao
Hi, I'd like to create a transform function, that convert RDD[String] to RDD[Int] Occasionally, the input RDD could be an empty RDD. I just want to directly create an empty RDD[Int] if the input RDD is empty. And, I don't want to return None as the result. Is there an easy way to do

Re: How to create an empty RDD with a given type?

2015-01-12 Thread Xuelin Cao
ustin > > On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao > wrote: > >> >> >> Hi, >> >> I'd like to create a transform function, that convert RDD[String] to >> RDD[Int] >> >> Occasionally, the input RDD could be an empty RDD. I ju

IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found some statement is incompatible in Spark-sql. Here is my SQL. And the same SQL works fine in HIVE environment. SELECT *if(ad_user_id>1000, 1000, ad_user_id) as user_id* FROM ad_search_keywor

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm using Spark 1.2 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan wrote: > Hi Xuelin, > > > > What version of Spark are you using? > > > > Thanks, > > Daoyuan > > > > *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com] > *Sent:* Tues

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
APEETHAM | Amritapuri | Cell +919946535290 | > > > On Tue, Jan 20, 2015 at 4:45 PM, DEVAN M.S. wrote: > >> Which context are you using HiveContext or SQLContext ? Can you try with >> HiveContext >> ?? >> >> >> Devan M.S. | Research Associate | Cyber

In the HA master mode, how to identify the alive master?

2015-03-04 Thread Xuelin Cao
Hi, In our project, we use "stand alone duo master" + "zookeeper" to make the HA of spark master. Now the problem is, how do we know which master is the current alive master? We tried to read the info that the master stored in zookeeper. But we found there is no information to

Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Xuelin Cao
Hey, Recently, we found in our cluster, that when we kill a spark streaming app, the whole cluster cannot response for 10 minutes. And, we investigate the master node, and found the master process consumes 100% CPU when we kill the spark streaming app. How could it happen? Did any

Is there a way to turn on spark eventLog on the worker node?

2014-11-21 Thread Xuelin Cao
Hi, I'm going to debug some spark applications on our testing platform. And it would be helpful if we can see the eventLog on the *worker *node. I've tried to turn on *spark.eventLog.enabled* and set *spark.eventLog.dir* parameters on the worker node. However, it doesn't work. I do

Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Xuelin Cao
Hi,      I'm going to debug some spark applications on our testing platform. And it would be helpful if we can see the eventLog on the worker node.      I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir  parameters on the worker node. However, it doesn't work.      I do ha

Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Xuelin Cao
Hi,       I'd like to make an operation on an RDD that ONLY change the value of   some items, without make a full copy or full scan of each data.      It is useful when I need to handle a large RDD, and each time I need only to change a little fraction of the data, and keeps other data unchanged.

Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao
Hi,     I'm generating a Spark SQL table from an offline Json file.     The difficulty is, in the original json file, there is a hierarchical structure. And, as a result, this is what I get: scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: array (nullable = true) |  

Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao
Hi,     I'm generating a Spark SQL table from an offline Json file.     The difficulty is, in the original json file, there is a hierarchical structure. And, as a result, this is what I get: scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: array (nullable = true) |  

Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao
Hi,     I'm generating a Spark SQL table from an offline Json file.     The difficulty is, in the original json file, there is a hierarchical structure. And, as a result, this is what I get: scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: array (nullable = true) |  

Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-08 Thread Xuelin Cao
Hi,       I'm wondering whether there is an  efficient way to continuously append new data to a registered spark SQL table.       This is what I want:      I want to make an ad-hoc query service to a json formated system log. Certainly, the system log is continuously generated. I will use spark