RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-09 Thread Haopu Wang
Cheng,

yes, it works, I set the property in SparkConf before initiating
SparkContext.
The property name is spark.hadoop.dfs.replication
Thanks fro the help!

-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com] 
Sent: Monday, June 08, 2015 6:41 PM
To: Haopu Wang; user
Subject: Re: SparkSQL: How to specify replication factor on the
persisted parquet files?

Then one possible workaround is to set dfs.replication in 
sc.hadoopConfiguration.

However, this configuration is shared by all Spark jobs issued within 
the same application. Since different Spark jobs can be issued from 
different threads, you need to pay attention to synchronization.

Cheng

On 6/8/15 2:46 PM, Haopu Wang wrote:
 Cheng, thanks for the response.

 Yes, I was using HiveContext.setConf() to set dfs.replication.
 However, I cannot change the value in Hadoop core-site.xml because
that
 will change every HDFS file.
 I only want to change the replication factor of some specific files.

 -Original Message-
 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Sunday, June 07, 2015 10:17 PM
 To: Haopu Wang; user
 Subject: Re: SparkSQL: How to specify replication factor on the
 persisted parquet files?

 Were you using HiveContext.setConf()?

 dfs.replication is a Hadoop configuration, but setConf() is only
used
 to set Spark SQL specific configurations. You may either set it in
your
 Hadoop core-site.xml.

 Cheng


 On 6/2/15 2:28 PM, Haopu Wang wrote:
 Hi,

 I'm trying to save SparkSQL DataFrame to a persistent Hive table
using
 the default parquet data source.

 I don't know how to change the replication factor of the generated
 parquet files on HDFS.

 I tried to set dfs.replication on HiveContext but that didn't work.
 Any suggestions are appreciated very much!


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-09 Thread ayan guha
Hi

I am little confused here. If I am writing to HDFS,shouldn't HDFS
replication factor will automatically kick in? In other words, how spark
writer is different than a hdfs -put commnd (from perspective of HDFS, of
course)?

Best
Ayan

On Tue, Jun 9, 2015 at 5:17 PM, Haopu Wang hw...@qilinsoft.com wrote:

 Cheng,

 yes, it works, I set the property in SparkConf before initiating
 SparkContext.
 The property name is spark.hadoop.dfs.replication
 Thanks fro the help!

 -Original Message-
 From: Cheng Lian [mailto:lian.cs@gmail.com]
 Sent: Monday, June 08, 2015 6:41 PM
 To: Haopu Wang; user
 Subject: Re: SparkSQL: How to specify replication factor on the
 persisted parquet files?

 Then one possible workaround is to set dfs.replication in
 sc.hadoopConfiguration.

 However, this configuration is shared by all Spark jobs issued within
 the same application. Since different Spark jobs can be issued from
 different threads, you need to pay attention to synchronization.

 Cheng

 On 6/8/15 2:46 PM, Haopu Wang wrote:
  Cheng, thanks for the response.
 
  Yes, I was using HiveContext.setConf() to set dfs.replication.
  However, I cannot change the value in Hadoop core-site.xml because
 that
  will change every HDFS file.
  I only want to change the replication factor of some specific files.
 
  -Original Message-
  From: Cheng Lian [mailto:lian.cs@gmail.com]
  Sent: Sunday, June 07, 2015 10:17 PM
  To: Haopu Wang; user
  Subject: Re: SparkSQL: How to specify replication factor on the
  persisted parquet files?
 
  Were you using HiveContext.setConf()?
 
  dfs.replication is a Hadoop configuration, but setConf() is only
 used
  to set Spark SQL specific configurations. You may either set it in
 your
  Hadoop core-site.xml.
 
  Cheng
 
 
  On 6/2/15 2:28 PM, Haopu Wang wrote:
  Hi,
 
  I'm trying to save SparkSQL DataFrame to a persistent Hive table
 using
  the default parquet data source.
 
  I don't know how to change the replication factor of the generated
  parquet files on HDFS.
 
  I tried to set dfs.replication on HiveContext but that didn't work.
  Any suggestions are appreciated very much!
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Best Regards,
Ayan Guha


RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-08 Thread Haopu Wang
Cheng, thanks for the response.

Yes, I was using HiveContext.setConf() to set dfs.replication.
However, I cannot change the value in Hadoop core-site.xml because that
will change every HDFS file.
I only want to change the replication factor of some specific files.

-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com] 
Sent: Sunday, June 07, 2015 10:17 PM
To: Haopu Wang; user
Subject: Re: SparkSQL: How to specify replication factor on the
persisted parquet files?

Were you using HiveContext.setConf()?

dfs.replication is a Hadoop configuration, but setConf() is only used 
to set Spark SQL specific configurations. You may either set it in your 
Hadoop core-site.xml.

Cheng


On 6/2/15 2:28 PM, Haopu Wang wrote:
 Hi,

 I'm trying to save SparkSQL DataFrame to a persistent Hive table using
 the default parquet data source.

 I don't know how to change the replication factor of the generated
 parquet files on HDFS.

 I tried to set dfs.replication on HiveContext but that didn't work.
 Any suggestions are appreciated very much!


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-08 Thread Cheng Lian
Then one possible workaround is to set dfs.replication in 
sc.hadoopConfiguration.


However, this configuration is shared by all Spark jobs issued within 
the same application. Since different Spark jobs can be issued from 
different threads, you need to pay attention to synchronization.


Cheng

On 6/8/15 2:46 PM, Haopu Wang wrote:

Cheng, thanks for the response.

Yes, I was using HiveContext.setConf() to set dfs.replication.
However, I cannot change the value in Hadoop core-site.xml because that
will change every HDFS file.
I only want to change the replication factor of some specific files.

-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Sunday, June 07, 2015 10:17 PM
To: Haopu Wang; user
Subject: Re: SparkSQL: How to specify replication factor on the
persisted parquet files?

Were you using HiveContext.setConf()?

dfs.replication is a Hadoop configuration, but setConf() is only used
to set Spark SQL specific configurations. You may either set it in your
Hadoop core-site.xml.

Cheng


On 6/2/15 2:28 PM, Haopu Wang wrote:

Hi,

I'm trying to save SparkSQL DataFrame to a persistent Hive table using
the default parquet data source.

I don't know how to change the replication factor of the generated
parquet files on HDFS.

I tried to set dfs.replication on HiveContext but that didn't work.
Any suggestions are appreciated very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-07 Thread Cheng Lian

Were you using HiveContext.setConf()?

dfs.replication is a Hadoop configuration, but setConf() is only used 
to set Spark SQL specific configurations. You may either set it in your 
Hadoop core-site.xml.


Cheng


On 6/2/15 2:28 PM, Haopu Wang wrote:

Hi,

I'm trying to save SparkSQL DataFrame to a persistent Hive table using
the default parquet data source.

I don't know how to change the replication factor of the generated
parquet files on HDFS.

I tried to set dfs.replication on HiveContext but that didn't work.
Any suggestions are appreciated very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-02 Thread Haopu Wang
Hi,

I'm trying to save SparkSQL DataFrame to a persistent Hive table using
the default parquet data source.

I don't know how to change the replication factor of the generated
parquet files on HDFS.

I tried to set dfs.replication on HiveContext but that didn't work.
Any suggestions are appreciated very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org