Re: [SPAM] Customized Aggregation Query on Spark SQL

2015-04-30 Thread Wenlei Xie
Hi Zhan,

How would this be achieved? Should the data be partitioned by name in this
case?

Thank you!

Best,
Wenlei

On Thu, Apr 30, 2015 at 7:55 PM, Zhan Zhang  wrote:

>  One optimization is to reduce the shuffle by first aggregate locally
> (only keep the max for each name), and then reduceByKey.
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Apr 24, 2015, at 10:03 PM, ayan guha  wrote:
>
>  Here you go
>
>  t =
> [["A",10,"A10"],["A",20,"A20"],["A",30,"A30"],["B",15,"B15"],["C",10,"C10"],["C",20,"C200"]]
> TRDD = sc.parallelize(t).map(lambda t:
> Row(name=str(t[0]),age=int(t[1]),other=str(t[2])))
> TDF = ssc.createDataFrame(TRDD)
> print TDF.printSchema()
> TDF.registerTempTable("tab")
> JN = ssc.sql("select t.name,t.age,t.other from tab t inner join
> (select name,max(age) age from tab group by name) t1 on t.name=t1.name
> and t.age=t1.age")
> for i in JN.collect():
> print i
>
>  Result:
>  Row(name=u'A', age=30, other=u'A30')
> Row(name=u'B', age=15, other=u'B15')
> Row(name=u'C', age=20, other=u'C200')
>
> On Sat, Apr 25, 2015 at 2:48 PM, Wenlei Xie  wrote:
>
>> Sure. A simple example of data would be (there might be many other
>> columns)
>>
>>  Name AgeOther
>> A   10A10
>> A20   A20
>> A30   A30
>> B15   B15
>> C10C10
>> C20   C20
>>
>>  The desired output would be
>> Name  AgeOther
>> A 30   A30
>> B 15   B15
>> C 20   C20
>>
>>  Thank you so much for the help!
>>
>> On Sat, Apr 25, 2015 at 12:41 AM, ayan guha  wrote:
>>
>>> can you give an example set of data and desired output>
>>>
>>> On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie 
>>> wrote:
>>>
>>>>  Hi,
>>>>
>>>>  I would like to answer the following customized aggregation query on
>>>> Spark SQL
>>>> 1. Group the table by the value of Name
>>>> 2. For each group, choose the tuple with the max value of Age (the ages
>>>> are distinct for every name)
>>>>
>>>>  I am wondering what's the best way to do it on Spark SQL? Should I
>>>> use UDAF? Previously I am doing something like the following on Spark:
>>>>
>>>>  personRDD.map(t => (t.name, t))
>>>> .reduceByKey((a, b) => if (a.age > b.age) a else b)
>>>>
>>>>  Thank you!
>>>>
>>>>  Best,
>>>> Wenlei
>>>>
>>>
>>>
>>>
>>>  --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>>   --
>>  Wenlei Xie (谢文磊)
>>
>> Ph.D. Candidate
>> Department of Computer Science
>> 456 Gates Hall, Cornell University
>> Ithaca, NY 14853, USA
>> Email: wenlei@gmail.com
>>
>
>
>
>  --
> Best Regards,
> Ayan Guha
>
>
>


-- 
Wenlei Xie (谢文磊)

Ph.D. Candidate
Department of Computer Science
456 Gates Hall, Cornell University
Ithaca, NY 14853, USA
Email: wenlei@gmail.com


Re: Super slow caching in 1.3?

2015-04-27 Thread Wenlei Xie
I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s
for 400MB data. The schema is similar to the TPC-H LineItem.

Here is the code I tried the cache. I am wondering if there is any setting
missing?

Thank you so much!

lineitemSchemaRDD.registerTempTable("lineitem");
sqlContext.sqlContext().cacheTable("lineitem");
System.out.println(lineitemSchemaRDD.count());


On Mon, Apr 6, 2015 at 8:00 PM, Christian Perez  wrote:

> Hi all,
>
> Has anyone else noticed very slow time to cache a Parquet file? It
> takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
> on M2 EC2 instances. Or are my expectations way off...
>
> Cheers,
>
> Christian
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christ...@svds.com
> @cp_phd
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Wenlei Xie (谢文磊)

Ph.D. Candidate
Department of Computer Science
456 Gates Hall, Cornell University
Ithaca, NY 14853, USA
Email: wenlei@gmail.com


Automatic Cache in SparkSQL

2015-04-27 Thread Wenlei Xie
Hi,

I am trying to answer a simple query with SparkSQL over the Parquet file.
When execute the query several times, the first run will take about 2s
while the later run will take <0.1s.

By looking at the log file it seems the later runs doesn't load the data
from disk. However, I didn't enable any cache explicitly. Is there any
automatic cache used by SparkSQL? Is there anyway to check this?

Thank you?

Best,
Wenlei


Understand the running time of SparkSQL queries

2015-04-26 Thread Wenlei Xie
Hi,

I am wondering how should we understand the running time of SparkSQL
queries? For example the physical query plan and the running time on each
stage? Is there any guide talking about this?

Thank you!

Best,
Wenlei


Customized Aggregation Query on Spark SQL

2015-04-24 Thread Wenlei Xie
Hi,

I would like to answer the following customized aggregation query on Spark
SQL
1. Group the table by the value of Name
2. For each group, choose the tuple with the max value of Age (the ages are
distinct for every name)

I am wondering what's the best way to do it on Spark SQL? Should I use
UDAF? Previously I am doing something like the following on Spark:

personRDD.map(t => (t.name, t))
.reduceByKey((a, b) => if (a.age > b.age) a else b)

Thank you!

Best,
Wenlei


Re: Number of input partitions in SparkContext.sequenceFile

2015-04-24 Thread Wenlei Xie
Hi,

I checked the number of partitions by

System.out.println("INFO: RDD with " + rdd.partitions().size() + "
partitions created.");


Each single split is about 100MB. I am currently loading the data from
local file system, would this explains this observation?

Thank you!

Best,
Wenlei

On Tue, Apr 21, 2015 at 6:28 AM, Archit Thakur 
wrote:

> Hi,
>
> It should generate the same no of partitions as the no. of splits.
> Howd you check no of partitions.? Also please paste your file size and
> hdfs-site.xml and mapred-site.xml here.
>
> Thanks and Regards,
> Archit Thakur.
>
> On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie  wrote:
>
>> Hi,
>>
>> I am wondering the mechanism that determines the number of partitions
>> created by SparkContext.sequenceFile ?
>>
>> For example, although my file has only 4 splits, Spark would create 16
>> partitions for it. Is it determined by the file size? Is there any way to
>> control it? (Looks like I can only tune minPartitions but not maxPartitions)
>>
>> Thank you!
>>
>> Best,
>> Wenlei
>>
>>
>>
>


-- 
Wenlei Xie (谢文磊)

Ph.D. Candidate
Department of Computer Science
456 Gates Hall, Cornell University
Ithaca, NY 14853, USA
Email: wenlei@gmail.com


Re: Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Use Object[] in Java just works :).

On Fri, Apr 24, 2015 at 4:56 PM, Wenlei Xie  wrote:

> Hi,
>
> I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java
> by using an List? It looks like
>
> ArrayList something;
> Row.create(something)
>
> will create a row with single column (and the single column contains the
> array)
>
> Best,
> Wenlei
>
>
>


-- 
Wenlei Xie (谢文磊)

Ph.D. Candidate
Department of Computer Science
456 Gates Hall, Cornell University
Ithaca, NY 14853, USA
Email: wenlei@gmail.com


Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Hi,

I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java
by using an List? It looks like

ArrayList something;
Row.create(something)

will create a row with single column (and the single column contains the
array)

Best,
Wenlei


Number of input partitions in SparkContext.sequenceFile

2015-04-18 Thread Wenlei Xie
Hi,

I am wondering the mechanism that determines the number of partitions
created by SparkContext.sequenceFile ?

For example, although my file has only 4 splits, Spark would create 16
partitions for it. Is it determined by the file size? Is there any way to
control it? (Looks like I can only tune minPartitions but not maxPartitions)

Thank you!

Best,
Wenlei


CPU Usage for Spark Local Mode

2015-04-04 Thread Wenlei Xie
Hi,

I am currently testing my application with Spark under local mode, and I
set the master to be local[4]. One thing I note is that when there is
groupBy/reduceBy operation involved, the CPU usage can sometimes be around
600% to 800%. I am wondering if this is expected? (As only 4 worker threads
are assigned, together with the driver thread, it should be 500%?)

Best,
Wenlei