Re: [SPAM] Customized Aggregation Query on Spark SQL
Hi Zhan, How would this be achieved? Should the data be partitioned by name in this case? Thank you! Best, Wenlei On Thu, Apr 30, 2015 at 7:55 PM, Zhan Zhang wrote: > One optimization is to reduce the shuffle by first aggregate locally > (only keep the max for each name), and then reduceByKey. > > Thanks. > > Zhan Zhang > > On Apr 24, 2015, at 10:03 PM, ayan guha wrote: > > Here you go > > t = > [["A",10,"A10"],["A",20,"A20"],["A",30,"A30"],["B",15,"B15"],["C",10,"C10"],["C",20,"C200"]] > TRDD = sc.parallelize(t).map(lambda t: > Row(name=str(t[0]),age=int(t[1]),other=str(t[2]))) > TDF = ssc.createDataFrame(TRDD) > print TDF.printSchema() > TDF.registerTempTable("tab") > JN = ssc.sql("select t.name,t.age,t.other from tab t inner join > (select name,max(age) age from tab group by name) t1 on t.name=t1.name > and t.age=t1.age") > for i in JN.collect(): > print i > > Result: > Row(name=u'A', age=30, other=u'A30') > Row(name=u'B', age=15, other=u'B15') > Row(name=u'C', age=20, other=u'C200') > > On Sat, Apr 25, 2015 at 2:48 PM, Wenlei Xie wrote: > >> Sure. A simple example of data would be (there might be many other >> columns) >> >> Name AgeOther >> A 10A10 >> A20 A20 >> A30 A30 >> B15 B15 >> C10C10 >> C20 C20 >> >> The desired output would be >> Name AgeOther >> A 30 A30 >> B 15 B15 >> C 20 C20 >> >> Thank you so much for the help! >> >> On Sat, Apr 25, 2015 at 12:41 AM, ayan guha wrote: >> >>> can you give an example set of data and desired output> >>> >>> On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to answer the following customized aggregation query on >>>> Spark SQL >>>> 1. Group the table by the value of Name >>>> 2. For each group, choose the tuple with the max value of Age (the ages >>>> are distinct for every name) >>>> >>>> I am wondering what's the best way to do it on Spark SQL? Should I >>>> use UDAF? Previously I am doing something like the following on Spark: >>>> >>>> personRDD.map(t => (t.name, t)) >>>> .reduceByKey((a, b) => if (a.age > b.age) a else b) >>>> >>>> Thank you! >>>> >>>> Best, >>>> Wenlei >>>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >> >> -- >> Wenlei Xie (谢文磊) >> >> Ph.D. Candidate >> Department of Computer Science >> 456 Gates Hall, Cornell University >> Ithaca, NY 14853, USA >> Email: wenlei@gmail.com >> > > > > -- > Best Regards, > Ayan Guha > > > -- Wenlei Xie (谢文磊) Ph.D. Candidate Department of Computer Science 456 Gates Hall, Cornell University Ithaca, NY 14853, USA Email: wenlei@gmail.com
Re: Super slow caching in 1.3?
I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s for 400MB data. The schema is similar to the TPC-H LineItem. Here is the code I tried the cache. I am wondering if there is any setting missing? Thank you so much! lineitemSchemaRDD.registerTempTable("lineitem"); sqlContext.sqlContext().cacheTable("lineitem"); System.out.println(lineitemSchemaRDD.count()); On Mon, Apr 6, 2015 at 8:00 PM, Christian Perez wrote: > Hi all, > > Has anyone else noticed very slow time to cache a Parquet file? It > takes 14 s per 235 MB (1 block) uncompressed node local Parquet file > on M2 EC2 instances. Or are my expectations way off... > > Cheers, > > Christian > > -- > Christian Perez > Silicon Valley Data Science > Data Analyst > christ...@svds.com > @cp_phd > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Wenlei Xie (谢文磊) Ph.D. Candidate Department of Computer Science 456 Gates Hall, Cornell University Ithaca, NY 14853, USA Email: wenlei@gmail.com
Automatic Cache in SparkSQL
Hi, I am trying to answer a simple query with SparkSQL over the Parquet file. When execute the query several times, the first run will take about 2s while the later run will take <0.1s. By looking at the log file it seems the later runs doesn't load the data from disk. However, I didn't enable any cache explicitly. Is there any automatic cache used by SparkSQL? Is there anyway to check this? Thank you? Best, Wenlei
Understand the running time of SparkSQL queries
Hi, I am wondering how should we understand the running time of SparkSQL queries? For example the physical query plan and the running time on each stage? Is there any guide talking about this? Thank you! Best, Wenlei
Customized Aggregation Query on Spark SQL
Hi, I would like to answer the following customized aggregation query on Spark SQL 1. Group the table by the value of Name 2. For each group, choose the tuple with the max value of Age (the ages are distinct for every name) I am wondering what's the best way to do it on Spark SQL? Should I use UDAF? Previously I am doing something like the following on Spark: personRDD.map(t => (t.name, t)) .reduceByKey((a, b) => if (a.age > b.age) a else b) Thank you! Best, Wenlei
Re: Number of input partitions in SparkContext.sequenceFile
Hi, I checked the number of partitions by System.out.println("INFO: RDD with " + rdd.partitions().size() + " partitions created."); Each single split is about 100MB. I am currently loading the data from local file system, would this explains this observation? Thank you! Best, Wenlei On Tue, Apr 21, 2015 at 6:28 AM, Archit Thakur wrote: > Hi, > > It should generate the same no of partitions as the no. of splits. > Howd you check no of partitions.? Also please paste your file size and > hdfs-site.xml and mapred-site.xml here. > > Thanks and Regards, > Archit Thakur. > > On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie wrote: > >> Hi, >> >> I am wondering the mechanism that determines the number of partitions >> created by SparkContext.sequenceFile ? >> >> For example, although my file has only 4 splits, Spark would create 16 >> partitions for it. Is it determined by the file size? Is there any way to >> control it? (Looks like I can only tune minPartitions but not maxPartitions) >> >> Thank you! >> >> Best, >> Wenlei >> >> >> > -- Wenlei Xie (谢文磊) Ph.D. Candidate Department of Computer Science 456 Gates Hall, Cornell University Ithaca, NY 14853, USA Email: wenlei@gmail.com
Re: Creating a Row in SparkSQL 1.2 from ArrayList
Use Object[] in Java just works :). On Fri, Apr 24, 2015 at 4:56 PM, Wenlei Xie wrote: > Hi, > > I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java > by using an List? It looks like > > ArrayList something; > Row.create(something) > > will create a row with single column (and the single column contains the > array) > > Best, > Wenlei > > > -- Wenlei Xie (谢文磊) Ph.D. Candidate Department of Computer Science 456 Gates Hall, Cornell University Ithaca, NY 14853, USA Email: wenlei@gmail.com
Creating a Row in SparkSQL 1.2 from ArrayList
Hi, I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java by using an List? It looks like ArrayList something; Row.create(something) will create a row with single column (and the single column contains the array) Best, Wenlei
Number of input partitions in SparkContext.sequenceFile
Hi, I am wondering the mechanism that determines the number of partitions created by SparkContext.sequenceFile ? For example, although my file has only 4 splits, Spark would create 16 partitions for it. Is it determined by the file size? Is there any way to control it? (Looks like I can only tune minPartitions but not maxPartitions) Thank you! Best, Wenlei
CPU Usage for Spark Local Mode
Hi, I am currently testing my application with Spark under local mode, and I set the master to be local[4]. One thing I note is that when there is groupBy/reduceBy operation involved, the CPU usage can sometimes be around 600% to 800%. I am wondering if this is expected? (As only 4 worker threads are assigned, together with the driver thread, it should be 500%?) Best, Wenlei