Why my SQL UDF cannot be registered?

2014-12-15 Thread Xuelin Cao

Hi,
     I tried to create a function that to convert an Unix time stamp to the 
hour number in a day.
      It works if the code is like this:sqlContext.registerFunction("toHour", 
(x:Long)=>{new java.util.Date(x*1000).getHours})

      But, if I do it like this, it doesn't work:
      def toHour (x:Long) = {new java.util.Date(x*1000).getHours}      
sqlContext.registerFunction("toHour", toHour)
      The system reports an error::23: error: missing arguments for 
method toHour;follow this method with `_' if you want to treat it as a 
partially applied function              sqlContext.registerFunction("toHour", 
toHour)             Anyone can help on dealing with this error?


When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao

Hi, 
     In Spark SQL help document, it says "Some of these (such as indexes) are 
less important due to Spark SQL’s in-memory  computational model. Others are 
slotted for future releases of Spark SQL.   
   - Block level bitmap indexes and virtual columns (used to build indexes)"

     For our use cases, DB index is quite important. I have about 300G data in 
our database, and we always use "customer id" as a predicate for DB look up.  
Without DB index, we will have to scan all 300G data, and it will take > 1 
minute for a simple DB look up, while MySQL only takes 10 seconds. We tried to 
create an independent table for each "customer id", the result is pretty good, 
but the logic will be very complex. 
     I'm wondering when will Spark SQL supports DB index, and before that, is 
there an alternative way to support DB index function?
Thanks


Re: When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
 Thanks, I didn't try the partitioned table support (sounds like a hive feature)
Is there any guideline? Should I use hiveContext to create the table with 
partition firstly? 


 On Thursday, December 18, 2014 2:28 AM, Michael Armbrust 
 wrote:
   

 - Dev list
Have you looked at partitioned table support?  That would only scan data where 
the predicate matches the partition.  Depending on the cardinality of the 
customerId column that could be a good option for you.
On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao  wrote:

Hi, 
     In Spark SQL help document, it says "Some of these (such as indexes) are 
less important due to Spark SQL’s in-memory  computational model. Others are 
slotted for future releases of Spark SQL.
   - Block level bitmap indexes and virtual columns (used to build indexes)"

     For our use cases, DB index is quite important. I have about 300G data in 
our database, and we always use "customer id" as a predicate for DB look up.  
Without DB index, we will have to scan all 300G data, and it will take > 1 
minute for a simple DB look up, while MySQL only takes 10 seconds. We tried to 
create an independent table for each "customer id", the result is pretty good, 
but the logic will be very complex. 
     I'm wondering when will Spark SQL supports DB index, and before that, is 
there an alternative way to support DB index function?
Thanks



   

Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Xuelin Cao

Hi,
       I'm testing parquet file format, and the predicate pushdown is a very 
useful feature for us.
       However, it looks like the predicate push down doesn't work after I set  
      sqlContext.sql("SET spark.sql.parquet.filterPushdown=true")        Here 
is my sql:       sqlContext.sql("select adId, adTitle  from ad where 
groupId=10113000").collect

       Then, I checked the amount of input data on the WEB UI. But the amount 
of input data is ALWAYS 80.2M regardless whether I turn the 
spark.sql.parquet.filterPushdown flag on or off.
       I'm not sure, if there is anything that I must do when generating the 
parquet file in order to make the predicate pushdown available. (Like ORC file, 
when creating the ORC file, I need to explicitly sort the field that will be 
used for predicate pushdown)
       Anyone have any idea?
       And, anyone knows the internal mechanism for parquet predicate pushdown?
       Thanks
 

Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Xuelin Cao

Hi, 
      Curious and curious. I'm puzzled by the Spark SQL cached table.
      Theoretically, the cached table should be columnar table, and only scan 
the column that included in my SQL.
      However, in my test, I always see the whole table is scanned even though 
I only "select" one column in my SQL.
      Here is my code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
sqlContext.cacheTable("adTable")  //The table has > 10 columns
//First run, cache the table into memory
sqlContext.sql("select * from adTable").collect
//Second run, only one column is used. It should only scan a small fraction of 
data
sqlContext.sql("select adId from adTable").collect sqlContext.sql("select adId 
from adTable").collect
sqlContext.sql("select adId from adTable").collect

        What I found is, every time I run the SQL, in WEB UI, it shows the 
total amount of input data is always the same --- the total amount of the table.
        Is anything wrong? My expectation is:        1. The cached table is 
stored as columnar table        2. Since I only need one column in my SQL, the 
total amount of input data showed in WEB UI should be very small
        But what I found is totally not the case. Why?
        Thanks


Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-07 Thread Xuelin Cao
Yes, the problem is, I've turned the flag on.

One possible reason for this is, the parquet file supports "predicate
pushdown" by setting statistical min/max value of each column on parquet
blocks. If in my test, the "groupID=10113000" is scattered in all parquet
blocks, then the predicate pushdown fails.

But, I'm not quite sure about that. I don't know whether there is any other
reason that can lead to this.


On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger  wrote:

> But Xuelin already posted in the original message that the code was using
>
> SET spark.sql.parquet.filterPushdown=true
>
> On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv 
> wrote:
>
>> Quoting Michael:
>> Predicate push down into the input format is turned off by default
>> because there is a bug in the current parquet library that null pointers
>> when there are full row groups that are null.
>>
>> https://issues.apache.org/jira/browse/SPARK-4258
>>
>> You can turn it on if you want:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
>>
>> Daniel
>>
>> On 7 בינו׳ 2015, at 08:18, Xuelin Cao 
>> wrote:
>>
>>
>> Hi,
>>
>>I'm testing parquet file format, and the predicate pushdown is a
>> very useful feature for us.
>>
>>However, it looks like the predicate push down doesn't work after
>> I set
>>sqlContext.sql("SET spark.sql.parquet.filterPushdown=true")
>>
>>Here is my sql:
>>*sqlContext.sql("select adId, adTitle  from ad where
>> groupId=10113000").collect*
>>
>>Then, I checked the amount of input data on the WEB UI. But the
>> amount of input data is ALWAYS 80.2M regardless whether I turn the 
>> spark.sql.parquet.filterPushdown
>> flag on or off.
>>
>>I'm not sure, if there is anything that I must do when *generating
>> *the parquet file in order to make the predicate pushdown available.
>> (Like ORC file, when creating the ORC file, I need to explicitly sort the
>> field that will be used for predicate pushdown)
>>
>>Anyone have any idea?
>>
>>And, anyone knows the internal mechanism for parquet predicate
>> pushdown?
>>
>>Thanks
>>
>>
>>
>>
>


Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Hi,

 Currently, we are building up a middle scale spark cluster (100 nodes)
in our company. One thing bothering us is, the how spark manages the
resource (CPU, memory).

 I know there are 3 resource management modes: stand-along, Mesos, Yarn

 In the stand along mode, the cluster master simply allocates the
resource when the application is launched. In this mode, suppose an
engineer launches a spark-shell, claiming 100 CPU cores and 100G memory,
but doing nothing. But the cluster master simply allocates the resource to
this app even if the spark-shell does nothing. This is definitely not what
we want.

 What we want is, the resource is allocated when the actual task is
about to run. For example, in the map stage, the app may need 100 cores
because the RDD has 100 partitions, while in the reduce stage, only 20
cores is needed because the RDD is shuffled into 20 partitions.

 I'm not very clear about the granularity of the spark resource
management. In the stand-along mode, the resource is allocated when the app
is launched. What about Mesos and Yarn? Can they support task level
resource management?

 And, what is the recommended mode for resource management? (Mesos?
Yarn?)

 Thanks


Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Hi,

 Thanks for the information.

 One more thing I want to clarify, when does Mesos or Yarn allocate and
release the resource? Aka, what is the resource life time?

 For example, in the stand-along mode, the resource is allocated when
the application is launched, resource released when the application
finishes.

 Then, it looks like, in the Mesos fine-grain mode, the resource is
allocated when the task is about to run; and released when the task
finishes.

 How about Mesos coarse-grain mode and Yarn mode?  Is the resource
managed on the Job level? Aka, the resource life time equals the job life
time? Or on the stage level?

 One more question for the Mesos fine-grain mode. How is the overhead
of resource allocation and release? In MapReduce, a noticeable time is
spend on waiting the resource allocation. What is Mesos fine-grain mode?



On Thu, Jan 8, 2015 at 3:07 PM, Tim Chen  wrote:

> Hi Xuelin,
>
> I can only speak about Mesos mode. There are two modes of management in
> Spark's Mesos scheduler, which are fine-grain mode and coarse-grain mode.
>
> In fine grain mode, each spark task launches one or more spark executors
> that only live through the life time of the task. So it's comparable to
> what you spoke about.
>
> In coarse grain mode it's going to support dynamic allocation of executors
> but that's being at a higher level than tasks.
>
> As for resource management recommendation, I think it's important to see
> what other applications you want to be running besides Spark in the same
> cluster and also your use cases, to see what resource management fits your
> need.
>
> Tim
>
>
> On Wed, Jan 7, 2015 at 10:55 PM, Xuelin Cao 
> wrote:
>
>>
>> Hi,
>>
>>  Currently, we are building up a middle scale spark cluster (100
>> nodes) in our company. One thing bothering us is, the how spark manages the
>> resource (CPU, memory).
>>
>>  I know there are 3 resource management modes: stand-along, Mesos,
>> Yarn
>>
>>  In the stand along mode, the cluster master simply allocates the
>> resource when the application is launched. In this mode, suppose an
>> engineer launches a spark-shell, claiming 100 CPU cores and 100G memory,
>> but doing nothing. But the cluster master simply allocates the resource to
>> this app even if the spark-shell does nothing. This is definitely not what
>> we want.
>>
>>  What we want is, the resource is allocated when the actual task is
>> about to run. For example, in the map stage, the app may need 100 cores
>> because the RDD has 100 partitions, while in the reduce stage, only 20
>> cores is needed because the RDD is shuffled into 20 partitions.
>>
>>  I'm not very clear about the granularity of the spark resource
>> management. In the stand-along mode, the resource is allocated when the app
>> is launched. What about Mesos and Yarn? Can they support task level
>> resource management?
>>
>>  And, what is the recommended mode for resource management? (Mesos?
>> Yarn?)
>>
>>  Thanks
>>
>>
>>
>


Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Got it, thanks.


On Thu, Jan 8, 2015 at 3:30 PM, Tim Chen  wrote:

> In coarse grain mode, the spark executors are launched and kept running
> while the scheduler is running. So if you have a spark shell launched and
> remained open, the executors are running and won't finish until the shell
> is exited.
>
> In fine grain mode, the overhead time mostly comes from downloading the
> spark tar (if it's not already deployed in the slaves) and launching the
> spark executor. I suggest you try it out and look at the latency to see if
> it fits your use case or not.
>
> Tim
>
> On Wed, Jan 7, 2015 at 11:19 PM, Xuelin Cao 
> wrote:
>
>>
>> Hi,
>>
>>  Thanks for the information.
>>
>>  One more thing I want to clarify, when does Mesos or Yarn allocate
>> and release the resource? Aka, what is the resource life time?
>>
>>  For example, in the stand-along mode, the resource is allocated when
>> the application is launched, resource released when the application
>> finishes.
>>
>>  Then, it looks like, in the Mesos fine-grain mode, the resource is
>> allocated when the task is about to run; and released when the task
>> finishes.
>>
>>  How about Mesos coarse-grain mode and Yarn mode?  Is the resource
>> managed on the Job level? Aka, the resource life time equals the job life
>> time? Or on the stage level?
>>
>>  One more question for the Mesos fine-grain mode. How is the overhead
>> of resource allocation and release? In MapReduce, a noticeable time is
>> spend on waiting the resource allocation. What is Mesos fine-grain mode?
>>
>>
>>
>> On Thu, Jan 8, 2015 at 3:07 PM, Tim Chen  wrote:
>>
>>> Hi Xuelin,
>>>
>>> I can only speak about Mesos mode. There are two modes of management in
>>> Spark's Mesos scheduler, which are fine-grain mode and coarse-grain mode.
>>>
>>> In fine grain mode, each spark task launches one or more spark executors
>>> that only live through the life time of the task. So it's comparable to
>>> what you spoke about.
>>>
>>> In coarse grain mode it's going to support dynamic allocation of
>>> executors but that's being at a higher level than tasks.
>>>
>>> As for resource management recommendation, I think it's important to see
>>> what other applications you want to be running besides Spark in the same
>>> cluster and also your use cases, to see what resource management fits your
>>> need.
>>>
>>> Tim
>>>
>>>
>>> On Wed, Jan 7, 2015 at 10:55 PM, Xuelin Cao 
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>>  Currently, we are building up a middle scale spark cluster (100
>>>> nodes) in our company. One thing bothering us is, the how spark manages the
>>>> resource (CPU, memory).
>>>>
>>>>  I know there are 3 resource management modes: stand-along, Mesos,
>>>> Yarn
>>>>
>>>>  In the stand along mode, the cluster master simply allocates the
>>>> resource when the application is launched. In this mode, suppose an
>>>> engineer launches a spark-shell, claiming 100 CPU cores and 100G memory,
>>>> but doing nothing. But the cluster master simply allocates the resource to
>>>> this app even if the spark-shell does nothing. This is definitely not what
>>>> we want.
>>>>
>>>>  What we want is, the resource is allocated when the actual task is
>>>> about to run. For example, in the map stage, the app may need 100 cores
>>>> because the RDD has 100 partitions, while in the reduce stage, only 20
>>>> cores is needed because the RDD is shuffled into 20 partitions.
>>>>
>>>>  I'm not very clear about the granularity of the spark resource
>>>> management. In the stand-along mode, the resource is allocated when the app
>>>> is launched. What about Mesos and Yarn? Can they support task level
>>>> resource management?
>>>>
>>>>  And, what is the recommended mode for resource management? (Mesos?
>>>> Yarn?)
>>>>
>>>>  Thanks
>>>>
>>>>
>>>>
>>>
>>
>


Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng

 I checked the Input data for each stage. For example, in my attached
screen snapshot, the input data is 1212.5MB, which is the total amount of
the whole table

[image: Inline image 1]

 And, I also check the input data for each task (in the stage detail
page). And the sum of the input data for each task is also 1212.5MB




On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian  wrote:

>  Hey Xuelin, which data item in the Web UI did you check?
>
>
> On 1/7/15 5:37 PM, Xuelin Cao wrote:
>
>
>  Hi,
>
>Curious and curious. I'm puzzled by the Spark SQL cached table.
>
>Theoretically, the cached table should be columnar table, and only
> scan the column that included in my SQL.
>
>However, in my test, I always see the whole table is scanned even
> though I only "select" one column in my SQL.
>
>Here is my code:
>
>
> *val sqlContext = new org.apache.spark.sql.SQLContext(sc) *
>
> *import sqlContext._ *
>
> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable") *
> *sqlContext.cacheTable("adTable")  //The table has > 10 columns*
>
>  *//First run, cache the table into memory*
>  *sqlContext.sql("select * from adTable").collect*
>
>  *//Second run, only one column is used. It should only scan a small
> fraction of data*
>  *sqlContext.sql("select adId from adTable").collect *
>
> *sqlContext.sql("select adId from adTable").collect *
> *sqlContext.sql("select adId from adTable").collect*
>
>  What I found is, every time I run the SQL, in WEB UI, it shows
> the total amount of input data is always the same --- the total amount of
> the table.
>
>  Is anything wrong? My expectation is:
> 1. The cached table is stored as columnar table
> 2. Since I only need one column in my SQL, the total amount of
> input data showed in WEB UI should be very small
>
>  But what I found is totally not the case. Why?
>
>  Thanks
>
>
>


Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng

  In your code:

cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()

 Running the first sql, the whole table is not cached yet. So the *input
data will be the original json file. *
 After it is cached, the json format data is removed, so the total
amount of data also drops.

 If you try like this:

cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()
sql("select * from tbl").collect()

 Is the input data of the 3rd SQL bigger than 49.1KB?




On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian  wrote:

>  Weird, which version did you use? Just tried a small snippet in Spark
> 1.2.0 shell as follows, the result showed in the web UI meets the
> expectation quite well:
>
> import org.apache.spark.sql.SQLContextimport sc._
> val sqlContext = new SQLContext(sc)import sqlContext._
>
> jsonFile("file:///tmp/p.json").registerTempTable("tbl")
> cacheTable("tbl")
> sql("select * from tbl").collect()
> sql("select name from tbl").collect()
>
> The input data of the first statement is 292KB, the second is 49.1KB.
>
> The JSON file I used is examples/src/main/resources/people.json, I copied
> its contents multiple times to generate a larger file.
>
> Cheng
>
> On 1/8/15 7:43 PM, Xuelin Cao wrote:
>
>
>
>  Hi, Cheng
>
>   I checked the Input data for each stage. For example, in my
> attached screen snapshot, the input data is 1212.5MB, which is the total
> amount of the whole table
>
>  [image: Inline image 1]
>
>   And, I also check the input data for each task (in the stage detail
> page). And the sum of the input data for each task is also 1212.5MB
>
>
>
>
> On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian  wrote:
>
>>  Hey Xuelin, which data item in the Web UI did you check?
>>
>>
>> On 1/7/15 5:37 PM, Xuelin Cao wrote:
>>
>>
>>  Hi,
>>
>>Curious and curious. I'm puzzled by the Spark SQL cached table.
>>
>>Theoretically, the cached table should be columnar table, and
>> only scan the column that included in my SQL.
>>
>>However, in my test, I always see the whole table is scanned even
>> though I only "select" one column in my SQL.
>>
>>Here is my code:
>>
>>
>> *val sqlContext = new org.apache.spark.sql.SQLContext(sc) *
>>
>> *import sqlContext._ *
>>
>> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable") *
>> *sqlContext.cacheTable("adTable")  //The table has > 10 columns*
>>
>>  *//First run, cache the table into memory*
>>  *sqlContext.sql("select * from adTable").collect*
>>
>>  *//Second run, only one column is used. It should only scan a small
>> fraction of data*
>>  *sqlContext.sql("select adId from adTable").collect *
>>
>> *sqlContext.sql("select adId from adTable").collect *
>> *sqlContext.sql("select adId from adTable").collect*
>>
>>  What I found is, every time I run the SQL, in WEB UI, it shows
>> the total amount of input data is always the same --- the total amount of
>> the table.
>>
>>  Is anything wrong? My expectation is:
>> 1. The cached table is stored as columnar table
>> 2. Since I only need one column in my SQL, the total amount of
>> input data showed in WEB UI should be very small
>>
>>  But what I found is totally not the case. Why?
>>
>>  Thanks
>>
>>
>>
>​
>


Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Xuelin Cao
Hi,

  I'm wondering whether it is a good idea to overcommit CPU cores on
the spark cluster.

  For example, in our testing cluster, each worker machine has 24
physical CPU cores. However, we are allowed to set the CPU core number to
48 or more in the spark configuration file. As a result, we are allowed to
launch more tasks than the number of physical CPU cores.

  The motivation of overcommit CPU cores is, for many times, a task
cannot consume 100% resource of a single CPU core (due to I/O, shuffle,
etc.).

  So, overcommit the CPU cores allows more tasks running at the same
time, and makes the resource be used economically.

  But, is there any reason that we should not doing like this? Anyone
tried this?

  [image: Inline image 1]


Re: Did anyone tried overcommit of CPU cores?

2015-01-09 Thread Xuelin Cao
Thanks, but, how to increase the tasks per core?

For example, if the application claims 10 cores, is it possible to launch
100 tasks concurrently?



On Fri, Jan 9, 2015 at 2:57 PM, Jörn Franke  wrote:

> Hallo,
>
> Based on experiences with other software in virtualized environments I
> cannot really recommend this. However, I am not sure how Spark reacts. You
> may face unpredictable task failures depending on utilization, tasks
> connecting to external systems (databases etc.) may fail unexpectedly and
> this might be a problem for them (transactions not finishing etc.).
>
> Why not increase the tasks per core?
>
> Best regards
> Le 9 janv. 2015 06:46, "Xuelin Cao"  a écrit :
>
>
>> Hi,
>>
>>   I'm wondering whether it is a good idea to overcommit CPU cores on
>> the spark cluster.
>>
>>   For example, in our testing cluster, each worker machine has 24
>> physical CPU cores. However, we are allowed to set the CPU core number to
>> 48 or more in the spark configuration file. As a result, we are allowed to
>> launch more tasks than the number of physical CPU cores.
>>
>>   The motivation of overcommit CPU cores is, for many times, a task
>> cannot consume 100% resource of a single CPU core (due to I/O, shuffle,
>> etc.).
>>
>>   So, overcommit the CPU cores allows more tasks running at the same
>> time, and makes the resource be used economically.
>>
>>   But, is there any reason that we should not doing like this? Anyone
>> tried this?
>>
>>   [image: Inline image 1]
>>
>>
>>


How to create an empty RDD with a given type?

2015-01-12 Thread Xuelin Cao
Hi,

I'd like to create a transform function, that convert RDD[String] to
RDD[Int]

Occasionally, the input RDD could be an empty RDD. I just want to
directly create an empty RDD[Int] if the input RDD is empty. And, I don't
want to return None as the result.

Is there an easy way to do that?


Re: How to create an empty RDD with a given type?

2015-01-12 Thread Xuelin Cao
Got it, thanks!

On Tue, Jan 13, 2015 at 2:00 PM, Justin Yip  wrote:

> Xuelin,
>
> There is a function called emtpyRDD under spark context
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext>
>  which
> serves this purpose.
>
> Justin
>
> On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao 
> wrote:
>
>>
>>
>> Hi,
>>
>> I'd like to create a transform function, that convert RDD[String] to
>> RDD[Int]
>>
>> Occasionally, the input RDD could be an empty RDD. I just want to
>> directly create an empty RDD[Int] if the input RDD is empty. And, I don't
>> want to return None as the result.
>>
>> Is there an easy way to do that?
>>
>>
>>
>


IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi,

  I'm trying to migrate some hive scripts to Spark-SQL. However, I
found some statement is incompatible in Spark-sql.

  Here is my SQL. And the same SQL works fine in HIVE environment.

SELECT
  *if(ad_user_id>1000, 1000, ad_user_id) as user_id*
FROM
  ad_search_keywords

 What I found is, the parser reports error on the "*if*" statement:

No function to evaluate expression. type: AttributeReference, tree:
ad_user_id#4


 Anyone have any idea about this?


Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi, I'm using Spark 1.2


On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan 
wrote:

>  Hi Xuelin,
>
>
>
> What version of Spark are you using?
>
>
>
> Thanks,
>
> Daoyuan
>
>
>
> *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
> *Sent:* Tuesday, January 20, 2015 5:22 PM
> *To:* User
> *Subject:* IF statement doesn't work in Spark-SQL?
>
>
>
>
>
> Hi,
>
>
>
>   I'm trying to migrate some hive scripts to Spark-SQL. However, I
> found some statement is incompatible in Spark-sql.
>
>
>
>   Here is my SQL. And the same SQL works fine in HIVE environment.
>
>
>
> SELECT
>
>   *if(ad_user_id>1000, 1000, ad_user_id) as user_id*
>
> FROM
>
>   ad_search_keywords
>
>
>
>  What I found is, the parser reports error on the "*if*" statement:
>
>
>
> No function to evaluate expression. type: AttributeReference, tree:
> ad_user_id#4
>
>
>
>
>
>  Anyone have any idea about this?
>
>
>
>
>


Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Xuelin Cao
Hi,

 Yes, this is what I'm doing. I'm using hiveContext.hql() to run my
query.

  But, the problem still happens.



On Tue, Jan 20, 2015 at 7:24 PM, DEVAN M.S.  wrote:

> Add one more library
>
> libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0"
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> repalce sqlContext with hiveContext. Its working while using HiveContext
> for me.
>
>
>
> Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
> VIDYAPEETHAM | Amritapuri | Cell +919946535290 |
>
>
> On Tue, Jan 20, 2015 at 4:45 PM, DEVAN M.S.  wrote:
>
>> Which context are you using HiveContext or SQLContext ? Can you try with 
>> HiveContext
>> ??
>>
>>
>> Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
>> VIDYAPEETHAM | Amritapuri | Cell +919946535290 |
>>
>>
>> On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao 
>> wrote:
>>
>>>
>>> Hi, I'm using Spark 1.2
>>>
>>>
>>> On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan 
>>> wrote:
>>>
>>>>  Hi Xuelin,
>>>>
>>>>
>>>>
>>>> What version of Spark are you using?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Daoyuan
>>>>
>>>>
>>>>
>>>> *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
>>>> *Sent:* Tuesday, January 20, 2015 5:22 PM
>>>> *To:* User
>>>> *Subject:* IF statement doesn't work in Spark-SQL?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>>   I'm trying to migrate some hive scripts to Spark-SQL. However, I
>>>> found some statement is incompatible in Spark-sql.
>>>>
>>>>
>>>>
>>>>   Here is my SQL. And the same SQL works fine in HIVE environment.
>>>>
>>>>
>>>>
>>>> SELECT
>>>>
>>>>   *if(ad_user_id>1000, 1000, ad_user_id) as user_id*
>>>>
>>>> FROM
>>>>
>>>>   ad_search_keywords
>>>>
>>>>
>>>>
>>>>  What I found is, the parser reports error on the "*if*" statement:
>>>>
>>>>
>>>>
>>>> No function to evaluate expression. type: AttributeReference, tree:
>>>> ad_user_id#4
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  Anyone have any idea about this?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>


In the HA master mode, how to identify the alive master?

2015-03-04 Thread Xuelin Cao
Hi,

  In our project, we use "stand alone duo master" + "zookeeper" to make
the HA of spark master.

  Now the problem is, how do we know which master is the current alive
master?

  We tried to read the info that the master stored in zookeeper. But we
found there is no information to identify the "current alive master".

  Any suggestions for us?

Thanks


Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Xuelin Cao
Hey,

 Recently, we found in our cluster, that when we kill a spark streaming
app, the whole cluster cannot response for 10 minutes.

 And, we investigate the master node, and found the master process
consumes 100% CPU when we kill the spark streaming app.

 How could it happen? Did anyone had the similar problem before?


Is there a way to turn on spark eventLog on the worker node?

2014-11-21 Thread Xuelin Cao

Hi,

I'm going to debug some spark applications on our testing platform. And
it would be helpful if we can see the eventLog on the *worker *node. 

I've tried to turn on *spark.eventLog.enabled* and set
*spark.eventLog.dir* parameters on the worker node. However, it doesn't
work.

I do have event logs on my driver node, and I know how to turn it on.
However, the same settings doesn't work on the worker node. 

Can anyone help me to clarify whether event log is only available on
driver node?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-turn-on-spark-eventLog-on-the-worker-node-tp19464.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Xuelin Cao
Hi, 

    I'm going to debug some spark applications on our testing platform. And it 
would be helpful if we can see the eventLog on the worker node. 

    I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir 
parameters on the worker node. However, it doesn't work. 

    I do have event logs on my driver node, and I know how to turn it on. 
However, the same settings doesn't work on the worker node. 

    Can anyone help me to clarify whether event log is only available on driver 
node? 



Is it possible to just change the value of the items in RDD without making a full copy?

2014-12-02 Thread Xuelin Cao

Hi, 
     I'd like to make an operation on an RDD that ONLY change the value of  
some items, without make a full copy or full scan of each data.
     It is useful when I need to handle a large RDD, and each time I need only 
to change a little fraction of the data, and keeps other data unchanged. 
Certainly I don't want to make a full copy the data to the new RDD.
     For example, suppose I have a RDD that contains integer data from 0 to 
100. What I want is to make the first element of the RDD changed from 0 to 1, 
other elements untouched. 

     I tried this, but it doesn't work:
     var rdd = parallelize(Range(0,100))     rdd.mapPartitions({iter=> iter(0) 
= 1})      The reported error is :   value update is not a member of 
Iterator[Int]

     Anyone knows how to make it work?


Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao

Hi,
    I'm generating a Spark SQL table from an offline Json file.
    The difficulty is, in the original json file, there is a hierarchical 
structure. And, as a result, this is what I get:
scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: 
array (nullable = true) |    |-- element: string (containsNull = false) |-- 
status: integer (nullable = true) |-- third_party: integer (nullable = true) 
|-- userId: integer (nullable = true)
As you may have noticed, the table schema is with a hierarchical structure 
("element" field is a sub-field under the "filterIp" field). Then, my question 
is, how do I access the "element" field with SQL?



Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao

Hi,
    I'm generating a Spark SQL table from an offline Json file.
    The difficulty is, in the original json file, there is a hierarchical 
structure. And, as a result, this is what I get:
scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: 
array (nullable = true) |    |-- element: string (containsNull = false) |-- 
status: integer (nullable = true) |-- third_party: integer (nullable = true) 
|-- userId: integer (nullable = true)
As you may have noticed, the table schema is with a hierarchical structure 
("element" field is a sub-field under the "filterIp" field). Then, my question 
is, how do I access the "element" field with SQL?



   

Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Xuelin Cao


Hi,
    I'm generating a Spark SQL table from an offline Json file.
    The difficulty is, in the original json file, there is a hierarchical 
structure. And, as a result, this is what I get:
scala> tb.printSchemaroot |-- budget: double (nullable = true) |-- filterIp: 
array (nullable = true) |    |-- element: string (containsNull = false) |-- 
status: integer (nullable = true) |-- third_party: integer (nullable = true) 
|-- userId: integer (nullable = true)
As you may have noticed, the table schema is with a hierarchical structure 
("element" field is a sub-field under the "filterIp" field). Then, my question 
is, how do I access the "element" field with SQL?



Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-08 Thread Xuelin Cao

Hi,
      I'm wondering whether there is an  efficient way to continuously append 
new data to a registered spark SQL table.
      This is what I want:      I want to make an ad-hoc query service to a 
json formated system log. Certainly, the system log is continuously generated. 
I will use spark streaming to connect the system log as my input, and I want to 
find a way to effectively append the new data into an existed spark SQL table. 
Further more, I want the whole table being cached in memory/tachyon.
      It looks like spark sql supports the "INSERT" method, but only for 
parquet file. In addition, it is inefficient to insert a single row every time.
      I do know that somebody build a similar system that I want (ad-hoc query 
service to a on growing system log). So, there must be an efficient way. Anyone 
knows?