Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-28 Thread Raju Bairishetti
Try setting num partitions to (number of executors * number of cores) while
writing to dest location.

You should be very very careful while setting num partitions as incorrect
number may lead to shuffle.

On Fri, Dec 16, 2016 at 12:56 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> I am trying to save the files as Paraquet.
>
> On Thu, Dec 15, 2016 at 10:41 PM, Felix Cheung 
> wrote:
>
>> What is the format?
>>
>>
>> --
>> *From:* KhajaAsmath Mohammed 
>> *Sent:* Thursday, December 15, 2016 7:54:27 PM
>> *To:* user @spark
>> *Subject:* Spark Dataframe: Save to hdfs is taking long time
>>
>> Hi,
>>
>> I am using issue while saving the dataframe back to HDFS. It's taking
>> long time to run.
>>
>> val results_dataframe = sqlContext.sql("select gt.*,ct.* from 
>> PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where 
>> gt.vin=pt.vin and pt.cluster=ct.cluster")
>> results_dataframe.coalesce(numPartitions)
>> results_dataframe.persist(StorageLevel.MEMORY_AND_DISK)
>>
>> dataFrame.write.mode(saveMode).format(format)
>>   .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec"
>>   .save(outputPath)
>>
>> It was taking long time and total number of records for  this dataframe is 
>> 4903764
>>
>> I even increased number of partitions from 10 to 20, still no luck. Can 
>> anyone help me in resolving this performance issue
>>
>> Thanks,
>>
>> Asmath
>>
>>
>


-- 

--
Thanks,
Raju Bairishetti,
www.lazada.com


Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread KhajaAsmath Mohammed
I am trying to save the files as Paraquet.

On Thu, Dec 15, 2016 at 10:41 PM, Felix Cheung 
wrote:

> What is the format?
>
>
> --
> *From:* KhajaAsmath Mohammed 
> *Sent:* Thursday, December 15, 2016 7:54:27 PM
> *To:* user @spark
> *Subject:* Spark Dataframe: Save to hdfs is taking long time
>
> Hi,
>
> I am using issue while saving the dataframe back to HDFS. It's taking long
> time to run.
>
> val results_dataframe = sqlContext.sql("select gt.*,ct.* from 
> PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where 
> gt.vin=pt.vin and pt.cluster=ct.cluster")
> results_dataframe.coalesce(numPartitions)
> results_dataframe.persist(StorageLevel.MEMORY_AND_DISK)
>
> dataFrame.write.mode(saveMode).format(format)
>   .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec"
>   .save(outputPath)
>
> It was taking long time and total number of records for  this dataframe is 
> 4903764
>
> I even increased number of partitions from 10 to 20, still no luck. Can 
> anyone help me in resolving this performance issue
>
> Thanks,
>
> Asmath
>
>


Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread Felix Cheung
What is the format?



From: KhajaAsmath Mohammed 
Sent: Thursday, December 15, 2016 7:54:27 PM
To: user @spark
Subject: Spark Dataframe: Save to hdfs is taking long time

Hi,

I am using issue while saving the dataframe back to HDFS. It's taking long time 
to run.


val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable 
pt,ClusterTempTable ct,GamificationTempTable gt where gt.vin=pt.vin and 
pt.cluster=ct.cluster")
results_dataframe.coalesce(numPartitions)
results_dataframe.persist(StorageLevel.MEMORY_AND_DISK)

dataFrame.write.mode(saveMode).format(format)
  .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec"
  .save(outputPath)

It was taking long time and total number of records for  this dataframe is 
4903764

I even increased number of partitions from 10 to 20, still no luck. Can anyone 
help me in resolving this performance issue

Thanks,

Asmath


Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread KhajaAsmath Mohammed
Hi,

I am using issue while saving the dataframe back to HDFS. It's taking long
time to run.

val results_dataframe = sqlContext.sql("select gt.*,ct.* from
PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where
gt.vin=pt.vin and pt.cluster=ct.cluster")
results_dataframe.coalesce(numPartitions)
results_dataframe.persist(StorageLevel.MEMORY_AND_DISK)

dataFrame.write.mode(saveMode).format(format)
  .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec"
  .save(outputPath)

It was taking long time and total number of records for  this
dataframe is 4903764

I even increased number of partitions from 10 to 20, still no luck.
Can anyone help me in resolving this performance issue

Thanks,

Asmath


Re: save to HDFS

2014-07-24 Thread lmk
Thanks Akhil.
I was able to view the files. Actually I was trying to list the same using
regular ls and since it did not show anything I was concerned.
Thanks for showing me the right direction.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578p10583.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: save to HDFS

2014-07-24 Thread Akhil Das
This piece of code

saveAsHadoopFile[TextOutputFormat[NullWritable,Text]]("hdfs://
masteripaddress:9000/root/test-app/test1/")

Saves the RDD into HDFS, and yes you can physically see the files using the
hadoop command (hadoop fs -ls /root/test-app/test1 - yes you need to login
to the cluster). In case if you are not able to execute the command (like
hadoop command not found), you can do like $HADOOP_HOME/bin/hadoop fs -ls
/root/test-app/test1



Thanks
Best Regards


On Thu, Jul 24, 2014 at 4:34 PM, lmk 
wrote:

> Hi Akhil,
> I am sure that the RDD that I saved is not empty. I have tested it using
> take.
> But is there no way that I can see this saved physically like we do in the
> normal context? Can't I view this folder as I am already logged into the
> cluster?
> And, should I run hadoop fs -ls
> hdfs://masteripaddress:9000/root/test-app/test1/
> after I login to the cluster?
>
> Thanks,
> lmk
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578p10581.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: save to HDFS

2014-07-24 Thread lmk
Hi Akhil,
I am sure that the RDD that I saved is not empty. I have tested it using
take.
But is there no way that I can see this saved physically like we do in the
normal context? Can't I view this folder as I am already logged into the
cluster?
And, should I run hadoop fs -ls
hdfs://masteripaddress:9000/root/test-app/test1/
after I login to the cluster?

Thanks,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578p10581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: save to HDFS

2014-07-24 Thread Akhil Das
Are you sure the RDD that you were saving isn't empty!?

Are you seeing a _SUCCESS file in this location? hdfs://
masteripaddress:9000/root/test-app/test1/
 (Do hadoop fs -ls hdfs://masteripaddress:9000/root/test-app/test1/)


Thanks
Best Regards


On Thu, Jul 24, 2014 at 4:24 PM, lmk 
wrote:

> Hi,
> I have a scala application which I have launched into a spark cluster. I
> have the following statement trying to save to a folder in the master:
> saveAsHadoopFile[TextOutputFormat[NullWritable,
> Text]]("hdfs://masteripaddress:9000/root/test-app/test1/")
>
> The application is executed successfully and log says that save is complete
> also. But I am not able to find the file I have saved anywhere. Is there a
> way I can access this file?
>
> Pls advice.
>
> Regards,
> lmk
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


save to HDFS

2014-07-24 Thread lmk
Hi,
I have a scala application which I have launched into a spark cluster. I
have the following statement trying to save to a folder in the master:
saveAsHadoopFile[TextOutputFormat[NullWritable,
Text]]("hdfs://masteripaddress:9000/root/test-app/test1/")

The application is executed successfully and log says that save is complete
also. But I am not able to find the file I have saved anywhere. Is there a
way I can access this file?

Pls advice.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-to-HDFS-tp10578.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.