; wrote:
>
>> +cloud-dataproc-discuss
>>
>> On Wed, May 25, 2022 at 12:33 AM Ranadip Chatterjee
>> wrote:
>>
>>> To me, it seems like the data being processed on the 2 systems is not
>>> identical. Can't think of any other reason why the sing
of data and on Dataproc it
> receives 700gb. I don't understand why it's happening. It can mean that the
> graph is different. But the job is exactly the same. Could it be because
> the minor version of Spark is different?
>
> On Wed, May 25, 2022 at 12:27 AM Ranadip Chatterjee
Hi Ori,
A single task for the final step can result from various scenarios like an
aggregate operation that results in only 1 value (e.g count) or a key based
aggregate with only 1 key for example. There could be other scenarios as
well. However, that would be the case in both EMR and Dataproc if
3 will have slightly different nuances but will behave very
similarly to HDFS in this scenario.
So, for all practical purposes, it is safe to say Spark will progress the
job to completion in nearly all practical cases.
Regards,
Ranadip Chatterjee
On Fri, 21 Jan 2022 at 20:40, Sean Owen wrote:
>
Looks like your session user does not have the required privileges on the
remote hdfs directory that is holding the hive data. Since you get the
columns, your session is able to read the metadata, so connection to the
remote hiveserver2 is successful. You should be able to find more
troubleshooting
This will depend on multiple factors. Assuming we are talking significant
volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion
performance is the sole consideration (which is true in many production use
cases). Sqoop provides some potential optimisations specially around using
I know of projects that have done this but have never seen any advantage of
"using spark to do what sqoop does" - at least in a yarn cluster. Both
frameworks will have similar overheads of getting the containers allocated
by yarn and creating new jvms to do the work. Probably spark will have a
slig
T3l,
Did Sean Owen's suggestion help? If not, can you please share the behaviour?
Cheers.
On 20 Oct 2015 11:02 pm, "Lan Jiang" wrote:
> I think the data file is binary per the original post. So in this case,
> sc.binaryFiles should be used. However, I still recommend against using so
> many sma
a new SparkContext and a HiveContext - more of throwing a stone at
the dark - try and create a new set of contexts after the alter table to
try and reload the states at that point in time.
None of these have worked so far.
Any ideas, suggestions or experiences on similar lines..?
--
Regards,
Ranadip Chatterjee