RE: SparkSQL: How to specify replication factor on the persisted parquet files?
Cheng, yes, it works, I set the property in SparkConf before initiating SparkContext. The property name is spark.hadoop.dfs.replication Thanks fro the help! -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, June 08, 2015 6:41 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Then one possible workaround is to set dfs.replication in sc.hadoopConfiguration. However, this configuration is shared by all Spark jobs issued within the same application. Since different Spark jobs can be issued from different threads, you need to pay attention to synchronization. Cheng On 6/8/15 2:46 PM, Haopu Wang wrote: Cheng, thanks for the response. Yes, I was using HiveContext.setConf() to set dfs.replication. However, I cannot change the value in Hadoop core-site.xml because that will change every HDFS file. I only want to change the replication factor of some specific files. -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, June 07, 2015 10:17 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are appreciated very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL: How to specify replication factor on the persisted parquet files?
Hi I am little confused here. If I am writing to HDFS,shouldn't HDFS replication factor will automatically kick in? In other words, how spark writer is different than a hdfs -put commnd (from perspective of HDFS, of course)? Best Ayan On Tue, Jun 9, 2015 at 5:17 PM, Haopu Wang hw...@qilinsoft.com wrote: Cheng, yes, it works, I set the property in SparkConf before initiating SparkContext. The property name is spark.hadoop.dfs.replication Thanks fro the help! -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, June 08, 2015 6:41 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Then one possible workaround is to set dfs.replication in sc.hadoopConfiguration. However, this configuration is shared by all Spark jobs issued within the same application. Since different Spark jobs can be issued from different threads, you need to pay attention to synchronization. Cheng On 6/8/15 2:46 PM, Haopu Wang wrote: Cheng, thanks for the response. Yes, I was using HiveContext.setConf() to set dfs.replication. However, I cannot change the value in Hadoop core-site.xml because that will change every HDFS file. I only want to change the replication factor of some specific files. -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, June 07, 2015 10:17 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are appreciated very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha
RE: SparkSQL: How to specify replication factor on the persisted parquet files?
Cheng, thanks for the response. Yes, I was using HiveContext.setConf() to set dfs.replication. However, I cannot change the value in Hadoop core-site.xml because that will change every HDFS file. I only want to change the replication factor of some specific files. -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, June 07, 2015 10:17 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are appreciated very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL: How to specify replication factor on the persisted parquet files?
Then one possible workaround is to set dfs.replication in sc.hadoopConfiguration. However, this configuration is shared by all Spark jobs issued within the same application. Since different Spark jobs can be issued from different threads, you need to pay attention to synchronization. Cheng On 6/8/15 2:46 PM, Haopu Wang wrote: Cheng, thanks for the response. Yes, I was using HiveContext.setConf() to set dfs.replication. However, I cannot change the value in Hadoop core-site.xml because that will change every HDFS file. I only want to change the replication factor of some specific files. -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, June 07, 2015 10:17 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are appreciated very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL: How to specify replication factor on the persisted parquet files?
Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are appreciated very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL: How to specify replication factor on the persisted parquet files?
Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are appreciated very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org