from:"Michael Kelly"

Count for select not matching count for group by

2015-09-21 Thread Michael Kelly

Hi, I'm seeing some strange behaviour with spark 1.5, I have a dataframe that I have built from loading and joining some hive tables stored in s3. The dataframe is cached in memory, using df.cache. What I'm seeing is that the counts I get when I do a group by on a column are different from what

Re: Parquet writing gets progressively slower

2015-07-25 Thread Michael Kelly

and tries to merge them. The more the data there are, the longer it takes. One possible workaround is to disable summary files by setting parquet.enable.summary-metadata to false in sc.hadoopConfiguration. Cheng On 7/25/15 4:15 AM, Michael Kelly wrote: Hi, We are converting some csv log

Parquet writing gets progressively slower

2015-07-24 Thread Michael Kelly

Hi, We are converting some csv log files to parquet but the job is getting progressively slower the more files we add to the parquet folder. The parquet files are being written to s3, we are using a spark standalone cluster running on ec2 and the spark version is 1.4.1. The parquet files are

Count for select not matching count for group by

Re: Parquet writing gets progressively slower

Parquet writing gets progressively slower

3 matches

Site Navigation

Mail list logo

Footer information