Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-30 Thread Ranadip Chatterjee
gt; >> +cloud-dataproc-discuss >> >> On Wed, May 25, 2022 at 12:33 AM Ranadip Chatterjee >> wrote: >> >>> To me, it seems like the data being processed on the 2 systems is not >>> identical. Can't think of any other reason why the single task stage w

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-25 Thread Ranadip Chatterjee
proc it > receives 700gb. I don't understand why it's happening. It can mean that the > graph is different. But the job is exactly the same. Could it be because > the minor version of Spark is different? > > On Wed, May 25, 2022 at 12:27 AM Ranadip Chatterjee > wrote: > >> Hi

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-24 Thread Ranadip Chatterjee
Hi Ori, A single task for the final step can result from various scenarios like an aggregate operation that results in only 1 value (e.g count) or a key based aggregate with only 1 key for example. There could be other scenarios as well. However, that would be the case in both EMR and Dataproc if

Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread Ranadip Chatterjee
e slightly different nuances but will behave very similarly to HDFS in this scenario. So, for all practical purposes, it is safe to say Spark will progress the job to completion in nearly all practical cases. Regards, Ranadip Chatterjee On Fri, 21 Jan 2022 at 20:40, Sean Owen wrote: > Probably,

Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Ranadip Chatterjee
Looks like your session user does not have the required privileges on the remote hdfs directory that is holding the hive data. Since you get the columns, your session is able to read the metadata, so connection to the remote hiveserver2 is successful. You should be able to find more

Re: Sqoop vs spark jdbc

2016-08-24 Thread Ranadip Chatterjee
This will depend on multiple factors. Assuming we are talking significant volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion performance is the sole consideration (which is true in many production use cases). Sqoop provides some potential optimisations specially around using

Re: Sqoop on Spark

2016-04-06 Thread Ranadip Chatterjee
I know of projects that have done this but have never seen any advantage of "using spark to do what sqoop does" - at least in a yarn cluster. Both frameworks will have similar overheads of getting the containers allocated by yarn and creating new jvms to do the work. Probably spark will have a

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-21 Thread Ranadip Chatterjee
T3l, Did Sean Owen's suggestion help? If not, can you please share the behaviour? Cheers. On 20 Oct 2015 11:02 pm, "Lan Jiang" wrote: > I think the data file is binary per the original post. So in this case, > sc.binaryFiles should be used. However, I still recommend against

Hivecontext going out-of-sync issue

2015-06-18 Thread Ranadip Chatterjee
SparkContext and a HiveContext - more of throwing a stone at the dark - try and create a new set of contexts after the alter table to try and reload the states at that point in time. None of these have worked so far. Any ideas, suggestions or experiences on similar lines..? -- Regards, Ranadip