Re:Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread zhangxiongfei
Hi Pramod Is your data compressed? I encountered similar problem,however, after turned codegen on, the GC time was still very long.The size of input data for my map task is about 100M lzo file. My query is select ip, count(*) as c from stage_bitauto_adclick_d group by ip sort by c limit 100

Hive can not get the schema of an external table created by Spark SQL API createExternalTable

2015-05-07 Thread zhangxiongfei
Hi I was trying to create an external table named adclicktable by API def createExternalTable(tableName: String, path: String),then I can get the schema of this table successfully like below and this table can be queried normally.The data files are all Parquet files. sqlContext.sql(describe

Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread zhangxiongfei
Hi, I did some tests on Parquet Files with Spark SQL DataFrame API. I generated 36 gzip compressed parquet files by Spark SQL and stored them on Tachyon,The size of each file is about 222M.Then read them with below code. val tfs

Re:Re: Spark SQL 1.3.1 saveAsParquetFile will output tachyon file with different block size

2015-04-14 Thread zhangxiongfei
, zhangxiongfei wrote: Hi experts I run below code in Spark Shell to access parquet files in Tachyon. 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon val ta3 =sqlContext.parquetFile(tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m); 2.Second