Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981
Hi all, Found the answer from the following link: https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html I can successfully setup parquet block size with spark.hadoop.parquet.block.size. The following is the sample code: # init block_size = 512 * 1024 conf =

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
OK, will do. On Fri, Mar 16, 2018 at 4:41 PM Sean Owen wrote: > I think you can file a JIRA and open a PR. All of the bits that use "gpg > ... SHA512 file ..." can use shasum instead. > I would not change any existing release artifacts though. > > On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas

Re: Changing how we compute release hashes

2018-03-16 Thread Sean Owen
I think you can file a JIRA and open a PR. All of the bits that use "gpg ... SHA512 file ..." can use shasum instead. I would not change any existing release artifacts though. On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas wrote: > I have sha512sum on my Mac via Homebrew, but yeah as long as t

Re: Live Stream Code Review today (in like ~5 minutes)

2018-03-16 Thread Holden Karau
Ok and the recording is now being processed and will be posted at the same URL once its done ( https://www.youtube.com/watch?v=pXzVtEUjrLc ). You can also see a walk through with Cody merging his first PR ( https://www.youtube.com/watch?v=_SdNu7MezL4 ). Since I had a slight problem during the liv

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
I have sha512sum on my Mac via Homebrew, but yeah as long as the format is the same I suppose it doesn’t matter if we use shasum -a or sha512sum. So shall I file a JIRA + PR for this? Or should I leave the PR to a maintainer? And are we OK with updating all the existing release hashes to use the n

Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981
Hi all, Looks like it's parquet-specific issue. I can successfully write with 512k block-size if I use df.write.csv() or use df.write.text() (I can successfully do csv write when I put hadoop-lzo-0.4.15-cdh5.13.0.jar into the jars dir) sample code: block_size = 512 * 1024 conf = SparkConf().s

Live Stream Code Review today (in like ~5 minutes)

2018-03-16 Thread Holden Karau
I'm going to be doing another live stream code review today in ~5 minutes. You can join watch at https://www.youtube.com/watch?v=pXzVtEUjrLc & the result will be posted as well. In this review I'll look at PRs in both the Spark project and a related project, spark-testing-base. -- Twitter: https

Re: Changing how we compute release hashes

2018-03-16 Thread Felix Cheung
+1 there From: Sean Owen Sent: Friday, March 16, 2018 9:51:49 AM To: Felix Cheung Cc: rb...@netflix.com; Nicholas Chammas; Spark dev list Subject: Re: Changing how we compute release hashes I think the issue with that is that OS X doesn't have "sha512sum". Both i

Re: Changing how we compute release hashes

2018-03-16 Thread Sean Owen
I think the issue with that is that OS X doesn't have "sha512sum". Both it and Linux have "shasum -a 512" though. On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung wrote: > Instead of using gpg to create the sha512 hash file we could just change > to using sha512sum? That would output the right form

Re: Changing how we compute release hashes

2018-03-16 Thread Felix Cheung
Instead of using gpg to create the sha512 hash file we could just change to using sha512sum? That would output the right format that is in turns verifiable. From: Ryan Blue Sent: Friday, March 16, 2018 8:31:45 AM To: Nicholas Chammas Cc: Spark dev list Subject:

Re: Changing how we compute release hashes

2018-03-16 Thread Ryan Blue
+1 It's possible to produce the same file with gpg, but the sha*sum utilities are a bit easier to remember the syntax for. On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > To verify that I’ve downloaded a Hadoop release correctly, I can just do > this: > >