from:"Ranadip Chatterjee"

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-30 Thread Ranadip Chatterjee

Gzip files are not splittable. Hence using very large (i.e. non
partitioned) gzip files lead to contention at reading the files as readers
cannot scale beyond the number of gzip files to read.

Better to use a splittable compression format instead to allow frameworks
to scale up. Or manually manage scaling by using partitions, as you are
doing now.

On Mon, 30 May 2022, 08:54 Ori Popowski,  wrote:

> Thanks.
>
> Eventually the problem was solved. I am still not 100% sure what caused it
> but when I said the input was identical I simplified a bit because it was
> not (sorry for misleading, I thought this information would just be noise).
> Explanation: the input to the EMR job was gzips created by Firehose and
> partitioned hourly. The input to the Dataproc job is gzips created by Kafka
> Connect and is not partitioned hourly. Otherwise the *content* itself is
> identical.
>
> When we started partitioning the files hourly the problem went away
> completely.
>
> I am still not sure what's going on exactly. If someone has some insight
> it's welcome.
>
> Thanks!
>
> On Fri, May 27, 2022 at 9:45 PM Aniket Mokashi 
> wrote:
>
>> +cloud-dataproc-discuss
>>
>> On Wed, May 25, 2022 at 12:33 AM Ranadip Chatterjee 
>> wrote:
>>
>>> To me, it seems like the data being processed on the 2 systems is not
>>> identical. Can't think of any other reason why the single task stage will
>>> get a different number of input records in the 2 cases. 700gb of input to a
>>> single task is not good, and seems to be the bottleneck.
>>>
>>> On Wed, 25 May 2022, 06:32 Ori Popowski,  wrote:
>>>
>>>> Hi,
>>>>
>>>> Both jobs use spark.dynamicAllocation.enabled so there's no need to
>>>> change the number of executors. There are 702 executors in the Dataproc
>>>> cluster so this is not the problem.
>>>> About number of partitions - this I didn't change and it's still 400.
>>>> While writing this now, I am realising that I have more partitions than
>>>> executors, but the same situation applies to EMR.
>>>>
>>>> I am observing 1 task in the final stage also on EMR. The difference is
>>>> that on EMR that task receives 50K volume of data and on Dataproc it
>>>> receives 700gb. I don't understand why it's happening. It can mean that the
>>>> graph is different. But the job is exactly the same. Could it be because
>>>> the minor version of Spark is different?
>>>>
>>>> On Wed, May 25, 2022 at 12:27 AM Ranadip Chatterjee <
>>>> ranadi...@gmail.com> wrote:
>>>>
>>>>> Hi Ori,
>>>>>
>>>>> A single task for the final step can result from various scenarios
>>>>> like an aggregate operation that results in only 1 value (e.g count) or a
>>>>> key based aggregate with only 1 key for example. There could be other
>>>>> scenarios as well. However, that would be the case in both EMR and 
>>>>> Dataproc
>>>>> if the same code is run on the same data in both cases.
>>>>>
>>>>> On a separate note, since you have now changed the size and number of
>>>>> nodes, you may need to re-optimize the number and size of executors for 
>>>>> the
>>>>> job and perhaps the number of partitions as well to optimally use the
>>>>> cluster resources.
>>>>>
>>>>> Regards,
>>>>> Ranadip
>>>>>
>>>>> On Tue, 24 May 2022, 10:45 Ori Popowski,  wrote:
>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I migrated a job from EMR with Spark 2.4.4 to Dataproc with Spark
>>>>>> 2.4.8. I am creating a cluster with the exact same configuration, where 
>>>>>> the
>>>>>> only difference is that the original cluster uses 78 workers with 96 CPUs
>>>>>> and 768GiB memory each, and in the new cluster I am using 117 machines 
>>>>>> with
>>>>>> 64 CPUs and 512GiB each, to achieve the same amount of resources in the
>>>>>> cluster.
>>>>>>
>>>>>> The job is run with the same configuration (num of partitions,
>>>>>> parallelism, etc.) and reads the same data. However, something strange
>>>>>> happens and the job takes 20 hours. What I observed is that there is a
>>>>>> stage where the driver instantiates a single task, and this task never
>>>>>> starts because the shuffle of moving all the data to it takes forever.
>>>>>>
>>>>>> I also compared the runtime configuration and found some minor
>>>>>> differences (due to Dataproc being different from EMR) but I haven't 
>>>>>> found
>>>>>> any substantial difference.
>>>>>>
>>>>>> In other stages the cluster utilizes all the partitions (400), and
>>>>>> it's not clear to me why it decides to invoke a single task.
>>>>>>
>>>>>> Can anyone provide an insight as to why such a thing would happen?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>>
>

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-25 Thread Ranadip Chatterjee

To me, it seems like the data being processed on the 2 systems is not
identical. Can't think of any other reason why the single task stage will
get a different number of input records in the 2 cases. 700gb of input to a
single task is not good, and seems to be the bottleneck.

On Wed, 25 May 2022, 06:32 Ori Popowski,  wrote:

> Hi,
>
> Both jobs use spark.dynamicAllocation.enabled so there's no need to
> change the number of executors. There are 702 executors in the Dataproc
> cluster so this is not the problem.
> About number of partitions - this I didn't change and it's still 400.
> While writing this now, I am realising that I have more partitions than
> executors, but the same situation applies to EMR.
>
> I am observing 1 task in the final stage also on EMR. The difference is
> that on EMR that task receives 50K volume of data and on Dataproc it
> receives 700gb. I don't understand why it's happening. It can mean that the
> graph is different. But the job is exactly the same. Could it be because
> the minor version of Spark is different?
>
> On Wed, May 25, 2022 at 12:27 AM Ranadip Chatterjee 
> wrote:
>
>> Hi Ori,
>>
>> A single task for the final step can result from various scenarios like
>> an aggregate operation that results in only 1 value (e.g count) or a key
>> based aggregate with only 1 key for example. There could be other scenarios
>> as well. However, that would be the case in both EMR and Dataproc if the
>> same code is run on the same data in both cases.
>>
>> On a separate note, since you have now changed the size and number of
>> nodes, you may need to re-optimize the number and size of executors for the
>> job and perhaps the number of partitions as well to optimally use the
>> cluster resources.
>>
>> Regards,
>> Ranadip
>>
>> On Tue, 24 May 2022, 10:45 Ori Popowski,  wrote:
>>
>>> Hello
>>>
>>> I migrated a job from EMR with Spark 2.4.4 to Dataproc with Spark 2.4.8.
>>> I am creating a cluster with the exact same configuration, where the only
>>> difference is that the original cluster uses 78 workers with 96 CPUs and
>>> 768GiB memory each, and in the new cluster I am using 117 machines with 64
>>> CPUs and 512GiB each, to achieve the same amount of resources in the
>>> cluster.
>>>
>>> The job is run with the same configuration (num of partitions,
>>> parallelism, etc.) and reads the same data. However, something strange
>>> happens and the job takes 20 hours. What I observed is that there is a
>>> stage where the driver instantiates a single task, and this task never
>>> starts because the shuffle of moving all the data to it takes forever.
>>>
>>> I also compared the runtime configuration and found some minor
>>> differences (due to Dataproc being different from EMR) but I haven't found
>>> any substantial difference.
>>>
>>> In other stages the cluster utilizes all the partitions (400), and it's
>>> not clear to me why it decides to invoke a single task.
>>>
>>> Can anyone provide an insight as to why such a thing would happen?
>>>
>>> Thanks
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-24 Thread Ranadip Chatterjee

Hi Ori,

A single task for the final step can result from various scenarios like an
aggregate operation that results in only 1 value (e.g count) or a key based
aggregate with only 1 key for example. There could be other scenarios as
well. However, that would be the case in both EMR and Dataproc if the same
code is run on the same data in both cases.

On a separate note, since you have now changed the size and number of
nodes, you may need to re-optimize the number and size of executors for the
job and perhaps the number of partitions as well to optimally use the
cluster resources.

Regards,
Ranadip

On Tue, 24 May 2022, 10:45 Ori Popowski,  wrote:

> Hello
>
> I migrated a job from EMR with Spark 2.4.4 to Dataproc with Spark 2.4.8. I
> am creating a cluster with the exact same configuration, where the only
> difference is that the original cluster uses 78 workers with 96 CPUs and
> 768GiB memory each, and in the new cluster I am using 117 machines with 64
> CPUs and 512GiB each, to achieve the same amount of resources in the
> cluster.
>
> The job is run with the same configuration (num of partitions,
> parallelism, etc.) and reads the same data. However, something strange
> happens and the job takes 20 hours. What I observed is that there is a
> stage where the driver instantiates a single task, and this task never
> starts because the shuffle of moving all the data to it takes forever.
>
> I also compared the runtime configuration and found some minor differences
> (due to Dataproc being different from EMR) but I haven't found any
> substantial difference.
>
> In other stages the cluster utilizes all the partitions (400), and it's
> not clear to me why it decides to invoke a single task.
>
> Can anyone provide an insight as to why such a thing would happen?
>
> Thanks
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread Ranadip Chatterjee

Interesting question! I think this goes back to the roots of Spark. You ask
"But suppose if I am reading a file that is distributed across nodes in
partitions. So, what will happen if a partition fails that holds some
data?". Assuming you mean the distributed file system that holds the file
suffers a failure in the node that hosts the said partition - here's
something that might explain this better:

Spark does not come with its own persistent distributed file storage
system. So Spark relies on the underlying file storage system to provide
resilience over input file failure. Commonly, an open source stack will see
HDFS (Hadoop Distributed File System) being the persistent distributed file
system that Spark reads from and writes back to. In that case, your
specific case will likely mean a failure of the datanode of HDFS that was
hosting the partition of data that failed.

Now, if the failure happens after Spark has completely read that partition
and is in the middle of the job, Spark will progress with the job
unhindered because it does not need to go back and re-read the data.
Remember, Spark will save (as RDDs) all intermediate states of the data
between stages and Spark stages continue to save the snippet of DAG from
one RDD to the next. So, Spark can recover from it's own node failures in
the intermediate stages by simply rerunning the DAGs from the last saved
RDD to recover to the next stage. The only case where Spark will need to
reach out to HDFS is if the very first Spark stage encounters a failure
before it has created the RDD for that partition.

In that specific edge case, Spark will reach out to HDFS to request the
failed HDFS block. At that point, if HDFS detects that the datanode hosting
that block is not responding, it will transparently redirect Spark to
another replica of the same block. So, the job will progress unhindered in
this case (perhaps a tad slower as the read may no longer be node-local).
Only in the extreme scenarios where HDFS has a catastrophic failure, all
the replicas are offline at that exact moment or the file was saved with
only 1 replica - will the Spark job fail as there is no way for it to
recover the said partitions. (In my rather long time working with Spark, I
have never come across this scenario yet).

Other distributed file systems behave similarly - e.g. Google Cloud Storage
or Amazon S3 will have slightly different nuances but will behave very
similarly to HDFS in this scenario.

So, for all practical purposes, it is safe to say Spark will progress the
job to completion in nearly all practical cases.

Regards,
Ranadip Chatterjee

On Fri, 21 Jan 2022 at 20:40, Sean Owen  wrote:

> Probably, because Spark prefers locality, but not necessarily.
>
> On Fri, Jan 21, 2022 at 2:10 PM Siddhesh Kalgaonkar <
> kalgaonkarsiddh...@gmail.com> wrote:
>
>> Thank you so much for this information, Sean. One more question, that
>> when it wants to re-run the failed partition, where does it run? On the
>> same node or some other node?
>>
>>
>> On Fri, 21 Jan 2022, 23:41 Sean Owen,  wrote:
>>
>>> The Spark program already knows the partitions of the data and where
>>> they exist; that's just defined by the data layout. It doesn't care what
>>> data is inside. It knows partition 1 needs to be processed and if the task
>>> processing it fails, needs to be run again. I'm not sure where you're
>>> seeing data loss here? the data is already stored to begin with, not
>>> somehow consumed and deleted.
>>>
>>> On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar <
>>> kalgaonkarsiddh...@gmail.com> wrote:
>>>
>>>> Okay, so suppose I have 10 records distributed across 5 nodes and the
>>>> partition of the first node holding 2 records failed. I understand that it
>>>> will re-process this partition but how will it come to know that XYZ
>>>> partition was holding XYZ data so that it will pick again only those
>>>> records and reprocess it? In case of failure of a partition, is there a
>>>> data loss? or is it stored somewhere?
>>>>
>>>> Maybe my question is very naive but I am trying to understand it in a
>>>> better way.
>>>>
>>>> On Fri, Jan 21, 2022 at 11:32 PM Sean Owen  wrote:
>>>>
>>>>> In that case, the file exists in parts across machines. No, tasks
>>>>> won't re-read the whole file; no task does or can do that. Failed
>>>>> partitions are reprocessed, but as in the first pass, the same partition 
>>>>> is
>>>>> processed.
>>>>>
>>>>> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar <
>>>>> kalgaonkarsiddh...@gmai

Re: [Spark Core] Does spark support read from remote Hive server via JDBC

2017-06-08 Thread Ranadip Chatterjee

Looks like your session user does not have the required privileges on the
remote hdfs directory that is holding the hive data. Since you get the
columns, your session is able to read the metadata, so connection to the
remote hiveserver2 is successful. You should be able to find more
troubleshooting information in the remote hiveserver2 log file.

Try with select* limit 10 to keep it simple.

Ranadip


On 8 Jun 2017 6:31 pm, "Даша Ковальчук"  wrote:

The  result is count = 0.

2017-06-08 19:42 GMT+03:00 ayan guha :

> What is the result of test.count()?
>
> On Fri, 9 Jun 2017 at 1:41 am, Даша Ковальчук 
> wrote:
>
>> Thanks for your reply!
>> Yes, I tried this solution and had the same result. Maybe you have
>> another solution or maybe I can execute query in another way on remote
>> cluster?
>>
>> 2017-06-08 18:30 GMT+03:00 Даша Ковальчук :
>>
>>> Thanks for your reply!
>>> Yes, I tried this solution and had the same result. Maybe you have
>>> another solution or maybe I can execute query in another way on remote
>>> cluster?
>>>
>>
>>> 2017-06-08 18:10 GMT+03:00 Vadim Semenov :
>>>
 Have you tried running a query? something like:

 ```
 test.select("*").limit(10).show()
 ```

 On Thu, Jun 8, 2017 at 4:16 AM, Даша Ковальчук <
 dashakovalchu...@gmail.com> wrote:

> Hi guys,
>
> I need to execute hive queries on remote hive server from spark, but
> for some reasons i receive only column names(without data).
> Data available in table, I checked it via HUE and java jdbc
>  connection.
>
> Here is my code example:
> val test = spark.read
> .option("url", "jdbc:hive2://remote.hive.server:
> 1/work_base")
> .option("user", "user")
> .option("password", "password")
> .option("dbtable", "some_table_with_data")
> .option("driver", "org.apache.hive.jdbc.HiveDriver")
> .format("jdbc")
> .load()
> test.show()
>
>
> Scala version: 2.11
> Spark version: 2.1.0, i also tried 2.1.1
> Hive version: CDH 5.7 Hive 1.1.1
> Hive JDBC version: 1.1.1
>
> But this problem available on Hive with later versions, too.
> I didn't find anything in mail group answers and StackOverflow.
> Could you, please, help me with this issue or could you help me find 
> correct
> solution how to query remote hive from spark?
>
> Thanks in advance!
>


>>> --
> Best Regards,
> Ayan Guha
>

Re: Sqoop vs spark jdbc

2016-08-24 Thread Ranadip Chatterjee

This will depend on multiple factors. Assuming we are talking significant
volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion
performance is the sole consideration (which is true in many production use
cases). Sqoop provides some potential optimisations specially around using
native database batch extraction tools that spark cannot take advantage of.
The performance inefficiency of using MR (actually map-only) is
insignificant over a large corpus of data. Further, in a shared cluster, if
the data volume is skewed for the given partition key, spark, without
dynamic container allocation, can be significantly inefficient from cluster
resources usage perspective. With dynamic allocation enabled, it is less so
but sqoop still has a slight edge due to the time Spark holds on to the
resources before giving them up.

If ingestion is part of a more complex DAG that relies on Spark cache (rdd
/ dataframe or dataset), then using Spark jdbc can have a significant
advantage in being able to cache the data without persisting into hdfs
first. But whether this will convert into an overall significantly better
performance of the DAG or cluster will depend on the DAG stages and their
performance. In general, if the ingestion stage is the significant
bottleneck in the DAG, then the advantage will be significant.

Hope this provides a general direction to consider in your case.

On 25 Aug 2016 3:09 a.m., "Venkata Penikalapati" <
mail.venkatakart...@gmail.com> wrote:

> Team,
> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
> ?
>
> I'm performing few analytics using spark data for which data is residing
> in rdbms.
>
> Please guide me with this.
>
>
> Thanks
> Venkata Karthik P
>
>

Re: Sqoop on Spark

2016-04-06 Thread Ranadip Chatterjee

I know of projects that have done this but have never seen any advantage of
"using spark to do what sqoop does" - at least in a yarn cluster. Both
frameworks will have similar overheads of getting the containers allocated
by yarn and creating new jvms to do the work. Probably spark will have a
slightly higher overhead due to creation of RDD before writing the data to
hdfs - something that the sqoop mapper need not do. (So what am I
overlooking here?)

In cases where a data pipeline is being built with the sqooped data being
the only trigger, there is a justification for using spark instead of sqoop
to short circuit the data directly into the transformation pipeline.

Regards
Ranadip
On 6 Apr 2016 7:05 p.m., "Michael Segel"  wrote:

> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink their ingestion process so
> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
> that term from Dean Wampler. ;-)
>
> Just saying. ;-)
>
> -Mike
>
> On Apr 5, 2016, at 10:44 PM, Jörn Franke  wrote:
>
> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha  wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke  wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha  wrote:
>>
>> Hi
>>
>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>> index and incremental only) and add data to existing Hive tables. Also, it
>> would be good to have an option to create Hive table, driven by job
>> specific configuration.
>>
>> What do you think?
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> It depends on your use case using sqoop.
>>> What's it like?
>>>
>>> // maropu
>>>
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha  wrote:
>>>
 Hi All

 Asking opinion: is it possible/advisable to use spark to replace what
 sqoop does? Any existing project done in similar lines?

 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-21 Thread Ranadip Chatterjee

T3l,

Did Sean Owen's suggestion help? If not, can you please share the behaviour?

Cheers.
On 20 Oct 2015 11:02 pm, "Lan Jiang"  wrote:

> I think the data file is binary per the original post. So in this case,
> sc.binaryFiles should be used. However, I still recommend against using so
> many small binary files as
>
> 1. They are not good for batch I/O
> 2. They put too many memory pressure on namenode.
>
> Lan
>
>
> On Oct 20, 2015, at 11:20 AM, Deenar Toraskar 
> wrote:
>
> also check out wholeTextFiles
>
>
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)
>
> On 20 October 2015 at 15:04, Lan Jiang  wrote:
>
>> As Francois pointed out, you are encountering a classic small file
>> anti-pattern. One solution I used in the past is to wrap all these small
>> binary files into a sequence file or avro file. For example, the avro
>> schema can have two fields: filename: string and binaryname:byte[]. Thus
>> your file is splittable and will not create so many partitions.
>>
>> Lan
>>
>>
>> On Oct 20, 2015, at 8:03 AM, François Pelletier <
>> newslett...@francoispelletier.org> wrote:
>>
>> You should aggregate your files in larger chunks before doing anything
>> else. HDFS is not fit for small files. It will bloat it and cause you a lot
>> of performance issues. Target a few hundred MB chunks partition size and
>> then save those files back to hdfs and then delete the original ones. You
>> can read, use coalesce and the saveAsXXX on the result.
>>
>> I had the same kind of problem once and solved it in bunching 100's of
>> files together in larger ones. I used text files with bzip2 compression.
>>
>>
>>
>> Le 2015-10-20 08:42, Sean Owen a écrit :
>>
>> coalesce without a shuffle? it shouldn't be an action. It just treats
>> many partitions as one.
>>
>> On Tue, Oct 20, 2015 at 1:00 PM, t3l  wrote:
>>
>>>
>>> I have dataset consisting of 5 binary files (each between 500kb and
>>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>>> cluster are also the workers for Spark. I open the files as a RDD using
>>> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action
>>> that
>>> involves this RDD, Spark spawns a RDD with more than 3 partitions.
>>> And
>>> this takes ages to process these partitions even if you simply run
>>> "count".
>>> Performing a "repartition" directly after loading does not help, because
>>> Spark seems to insist on materializing the RDD created by binaryFiles
>>> first.
>>>
>>> How I can get around this?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/A-Spark-creates-3-partitions-What-can-I-do-tp25140.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> .
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>>
>
>

Hivecontext going out-of-sync issue

2015-06-18 Thread Ranadip Chatterjee

Hi All.

I have a partitioned table in Hive. The use case is to drop one of the
partitions before inserting new data every time the Spark process runs. I
am using the Hivecontext to read and write (dynamic partitions) and also to
alter the table to drop the partition before insert. Everything runs fine
when run for the first time (the partition being inserted didn't exist
before). However, if the partition existed and was dropped by the alter
table command in the same process, then the insert fails with the error
"FileNotFoundException: File does not exist : /part_col=val1/part-0". When the program is rerun as-is, it
succeeds as now the partition does not exist any more when it starts up.
Spark 1.3.0 on CDH5.4.0.

Things I have tried:
- Put a pause of up to 1 min between alter table and insert to ensure that
any possibly pending async task in the background gets time to finish.
- Create a new Hivecontext object and call Insert with it (Call "drop
partition" and insert using separate hive context objects). The intention
was perhaps a new hive context will be created with the correct state of
the hive metastore at that moment and should work.
- Create a new SparkContext and a HiveContext - more of throwing a stone at
the dark - try and create a new set of contexts after the alter table to
try and reload the states at that point in time.

None of these have worked so far.

Any ideas, suggestions or experiences on similar lines..?

-- 
Regards,
Ranadip Chatterjee

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

Re: What happens when a partition that holds data under a task fails

Re: [Spark Core] Does spark support read from remote Hive server via JDBC

Re: Sqoop vs spark jdbc

Re: Sqoop on Spark

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

Hivecontext going out-of-sync issue

9 matches

Site Navigation

Mail list logo

Footer information