[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
[ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877303#action_12877303 ] Ning Zhang commented on HIVE-1139: -- Arvind, I remember I got this problem (non-serializable) problem before and the problem boils down to support ObjectInspector in the same way as RowContainer does. In terms of a complete interface of Map, it would be good to have but I would be very cautious about performance penalty it may add. If the new API is not needed for now, I would vote for the ObjectInspector support first. > GroupByOperator sometimes throws OutOfMemory error when there are too many > distinct keys > > > Key: HIVE-1139 > URL: https://issues.apache.org/jira/browse/HIVE-1139 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Arvind Prabhakar > > When a partial aggregation performed on a mapper, a HashMap is created to > keep all distinct keys in main memory. This could leads to OOM exception when > there are too many distinct keys for a particular mapper. A workaround is to > set the map split size smaller so that each mapper takes less number of rows. > A better solution is to use the persistent HashMapWrapper (currently used in > CommonJoinOperator) to spill overflow rows to disk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877295#action_12877295 ] Prafulla Tekawade commented on HIVE-417: Yes Ashish, Thats what I had in mind. Rewrite system would need metadata, and hence it should be invoked after semantic analysis phase which would make metadata available. > Implement Indexing in Hive > -- > > Key: HIVE-417 > URL: https://issues.apache.org/jira/browse/HIVE-417 > Project: Hadoop Hive > Issue Type: New Feature > Components: Metastore, Query Processor >Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 >Reporter: Prasad Chakka >Assignee: He Yongqiang > Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, > hive-indexing.3.patch > > > Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
[ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877239#action_12877239 ] Arvind Prabhakar commented on HIVE-1139: Ashish - no problem - let me explain: The problem being addressed by this JIRA is that {{GroupByOperator}} and possibly other aggregation operators use in-memory maps to store intermediate keys, which could lead to {{OutOfMemoryException}} in case the number of such keys is large. It is suggested that one way to work around it is to use the {{HashMapWrapper}} class which would help alleviate the memory concern since it is capable of spilling the excess data to disk. The {{HashMapWrapper}} however, uses Java serialization to write out the excess data. This does not work when the data contains non-serializable objects such as {{Writable}} types - {{Text}} etc. What I have done so far is to modify the {{HashMapWrapper}} to support full {{java.util.Map}} interface. However, when I tried updating the {{GroupByOperator}} to use it, I ran into the said serialization problem. Thats why I was suggesting that perhaps we should decouple the serialization problem from enhancing the {{HashMapWrapper}} and let the later be checked independently. > GroupByOperator sometimes throws OutOfMemory error when there are too many > distinct keys > > > Key: HIVE-1139 > URL: https://issues.apache.org/jira/browse/HIVE-1139 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Arvind Prabhakar > > When a partial aggregation performed on a mapper, a HashMap is created to > keep all distinct keys in main memory. This could leads to OOM exception when > there are too many distinct keys for a particular mapper. A workaround is to > set the map split size smaller so that each mapper takes less number of rows. > A better solution is to use the persistent HashMapWrapper (currently used in > CommonJoinOperator) to spill overflow rows to disk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877236#action_12877236 ] Ashish Thusoo commented on HIVE-417: A couple of comments on this: A complication that happens by doing a rewrite just after parse is that you loose the ability to report back errors that correspond to the original query. Also the metadata that you need to do the rewrite is only available after phase 1 of semantic analysis. So in my opinion the rewrite should be done after semantic analysis but before plan generation. Is that what you had in mind... so something like... [Query parser] [Query semantic analysis] [Query optimization] ... > Implement Indexing in Hive > -- > > Key: HIVE-417 > URL: https://issues.apache.org/jira/browse/HIVE-417 > Project: Hadoop Hive > Issue Type: New Feature > Components: Metastore, Query Processor >Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 >Reporter: Prasad Chakka >Assignee: He Yongqiang > Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, > hive-indexing.3.patch > > > Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1398) Support union all without an outer select *
Support union all without an outer select * --- Key: HIVE-1398 URL: https://issues.apache.org/jira/browse/HIVE-1398 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Ashish Thusoo Assignee: Ashish Thusoo In hive for union alls the query has to be wrapped in an sub query as shown below: select * from (select c1 from t1 union all select c2 from t2); This JIRA proposes to fix that to support select c1 from t1 union all select c2 from t2; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
[ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877232#action_12877232 ] Ashish Thusoo commented on HIVE-1139: - Arvind, I thought the whole point of this JIRA was to make HashMapWrapper to support java.util.Map, no? If that would be a separate JIRA, what would this one be for? Sorry for being a bit dense here but if you could clarify that would be great. Thanks, Ashish > GroupByOperator sometimes throws OutOfMemory error when there are too many > distinct keys > > > Key: HIVE-1139 > URL: https://issues.apache.org/jira/browse/HIVE-1139 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Arvind Prabhakar > > When a partial aggregation performed on a mapper, a HashMap is created to > keep all distinct keys in main memory. This could leads to OOM exception when > there are too many distinct keys for a particular mapper. A workaround is to > set the map split size smaller so that each mapper takes less number of rows. > A better solution is to use the persistent HashMapWrapper (currently used in > CommonJoinOperator) to spill overflow rows to disk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column
[ https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877233#action_12877233 ] Ashish Thusoo commented on HIVE-1397: - +1. This would be a cool contribution. > histogram() UDAF for a numerical column > --- > > Key: HIVE-1397 > URL: https://issues.apache.org/jira/browse/HIVE-1397 > Project: Hadoop Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.6.0 >Reporter: Mayank Lahiri >Assignee: Mayank Lahiri > Fix For: 0.6.0 > > > A histogram() UDAF to generate an approximate histogram of a numerical (byte, > short, double, long, etc.) column. The result is returned as a map of (x,y) > histogram pairs, and can be plotted in Gnuplot using impulses (for example). > The algorithm is currently adapted from "A streaming parallel decision tree > algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space > proportional to the number of histogram bins specified. It has no > approximation guarantees, but seems to work well when there is a lot of data > and a large number (e.g. 50-100) of histogram bins specified. > A typical call might be: > SELECT histogram(val, 10) FROM some_table; > where the result would be a histogram with 10 bins, returned as a Hive map > object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1373) Missing connection pool plugin in Eclipse classpath
[ https://issues.apache.org/jira/browse/HIVE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashish Thusoo updated HIVE-1373: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Fix Version/s: 0.6.0 Resolution: Fixed Committed. Thanks Vinithra!! > Missing connection pool plugin in Eclipse classpath > --- > > Key: HIVE-1373 > URL: https://issues.apache.org/jira/browse/HIVE-1373 > Project: Hadoop Hive > Issue Type: Bug > Components: Build Infrastructure > Environment: Eclipse, Linux >Reporter: Vinithra Varadharajan >Assignee: Vinithra Varadharajan > Fix For: 0.6.0 > > Attachments: HIVE-1373.patch > > > In a recent checkin, connection pool dependency was introduced but eclipse > .classpath file was not updated. This causes launch configurations from > within Eclipse to fail. > {code} > hive> show tables; > show tables; > 10/05/26 14:59:46 INFO parse.ParseDriver: Parsing command: show tables > 10/05/26 14:59:46 INFO parse.ParseDriver: Parse Completed > 10/05/26 14:59:46 INFO ql.Driver: Semantic Analysis Completed > 10/05/26 14:59:46 INFO ql.Driver: Returning Hive schema: > Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from > deserializer)], properties:null) > 10/05/26 14:59:46 INFO ql.Driver: query plan = > file:/tmp/vinithra/hive_2010-05-26_14-59-46_058_1636674338194744357/queryplan.xml > 10/05/26 14:59:46 INFO ql.Driver: Starting command: show tables > 10/05/26 14:59:46 INFO metastore.HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 10/05/26 14:59:46 INFO metastore.ObjectStore: ObjectStore, initialize called > FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error > creating transactional connection factory > NestedThrowables: > java.lang.reflect.InvocationTargetException > 10/05/26 14:59:47 ERROR exec.DDLTask: FAILED: Error in metadata: > javax.jdo.JDOFatalInternalException: Error creating transactional connection > factory > NestedThrowables: > java.lang.reflect.InvocationTargetException > org.apache.hadoop.hive.ql.metadata.HiveException: > javax.jdo.JDOFatalInternalException: Error creating transactional connection > factory > NestedThrowables: > java.lang.reflect.InvocationTargetException > at org.apache.hadoop.hive.ql.metadata.Hive.getTablesForDb(Hive.java:491) > at > org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:472) > at org.apache.hadoop.hive.ql.metadata.Hive.getAllTables(Hive.java:458) > at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:504) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:176) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:107) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:55) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:631) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:504) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:382) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:303) > Caused by: javax.jdo.JDOFatalInternalException: Error creating transactional > connection factory > NestedThrowables: > java.lang.reflect.InvocationTargetException > at > org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:395) > at > org.datanucleus.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:547) > at > org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:175) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at javax.jdo.JDOHelper$16.run(JDOHelper.java:1956) > at java.security.AccessController.doPrivileged(Native Method) > at javax.jdo.JDOHelper.invoke(JDOHelper.java:1951) > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1159) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:191) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(Ob
[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
[ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877222#action_12877222 ] Arvind Prabhakar commented on HIVE-1139: If there is interest, I can file a separate JIRA for modifying {{HashMapWrapper}} to support the {{java.util.Map}} interface and decouple that work from this JIRA. I think there is a lot of benefit in doing just that. Also, we could have this JIRA depend upon that as a prerequisite. > GroupByOperator sometimes throws OutOfMemory error when there are too many > distinct keys > > > Key: HIVE-1139 > URL: https://issues.apache.org/jira/browse/HIVE-1139 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Arvind Prabhakar > > When a partial aggregation performed on a mapper, a HashMap is created to > keep all distinct keys in main memory. This could leads to OOM exception when > there are too many distinct keys for a particular mapper. A workaround is to > set the map split size smaller so that each mapper takes less number of rows. > A better solution is to use the persistent HashMapWrapper (currently used in > CommonJoinOperator) to spill overflow rows to disk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column
[ https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877220#action_12877220 ] Edward Capriolo commented on HIVE-1397: --- Looks great. Can not wait. > histogram() UDAF for a numerical column > --- > > Key: HIVE-1397 > URL: https://issues.apache.org/jira/browse/HIVE-1397 > Project: Hadoop Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.6.0 >Reporter: Mayank Lahiri >Assignee: Mayank Lahiri > Fix For: 0.6.0 > > > A histogram() UDAF to generate an approximate histogram of a numerical (byte, > short, double, long, etc.) column. The result is returned as a map of (x,y) > histogram pairs, and can be plotted in Gnuplot using impulses (for example). > The algorithm is currently adapted from "A streaming parallel decision tree > algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space > proportional to the number of histogram bins specified. It has no > approximation guarantees, but seems to work well when there is a lot of data > and a large number (e.g. 50-100) of histogram bins specified. > A typical call might be: > SELECT histogram(val, 10) FROM some_table; > where the result would be a histogram with 10 bins, returned as a Hive map > object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1386) HiveQL SQL Compliance (Umbrella)
[ https://issues.apache.org/jira/browse/HIVE-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877204#action_12877204 ] Edward Capriolo commented on HIVE-1386: --- As a side note maybe with something total order partitioner we can do a true map/reduce order by. > HiveQL SQL Compliance (Umbrella) > > > Key: HIVE-1386 > URL: https://issues.apache.org/jira/browse/HIVE-1386 > Project: Hadoop Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Carl Steinbach > > This is an umbrella ticket to track work related to HiveQL compliance with > the SQL standard, e.g. supported query syntax, data types, views, catalog > access, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Cannot access more than one hive prompt
On Wed, Jun 9, 2010 at 3:42 PM, jaydeep vishwakarma < jaydeep.vishwaka...@mkhoj.com> wrote: > Hi, > > I am trying to access two hive prompt from same machine. Only first one > is working. But other one hive prompt showing > following error when doing simple select query. > > FAILED: Error in semantic analysis: Unable to fetch table employee > > How to access more than one hive prompt in same system. > > Regards, > Jaydeep > > The information contained in this communication is intended solely for the > use of the individual or entity to whom it is addressed and others > authorized to receive it. It may contain confidential or legally privileged > information. If you are not the intended recipient you are hereby notified > that any disclosure, copying, distribution or taking any action in reliance > on the contents of this information is strictly prohibited and may be > unlawful. If you have received this communication in error, please notify us > immediately by responding to this email and then delete it from your system. > The firm is neither liable for the proper and complete transmission of the > information contained in this communication nor for any delay in its > receipt. > I remember once asking the same question :) http://wiki.apache.org/hadoop/HiveDerbyServerMode
[jira] Commented: (HIVE-1386) HiveQL SQL Compliance (Umbrella)
[ https://issues.apache.org/jira/browse/HIVE-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877183#action_12877183 ] Jeff Hammerbacher commented on HIVE-1386: - The discussion in HIVE-61 seems to indicate that SORT BY was decided upon as the syntax and that, to simulate ORDER BY behavior, one should set the number of reduce tasks to be 1. I don't have an instance of Hive running near by, but if there is now ORDER BY in the syntax, could you please update the language guide? > HiveQL SQL Compliance (Umbrella) > > > Key: HIVE-1386 > URL: https://issues.apache.org/jira/browse/HIVE-1386 > Project: Hadoop Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Carl Steinbach > > This is an umbrella ticket to track work related to HiveQL compliance with > the SQL standard, e.g. supported query syntax, data types, views, catalog > access, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Cannot access more than one hive prompt
Hi, I am trying to access two hive prompt from same machine. Only first one is working. But other one hive prompt showing following error when doing simple select query. FAILED: Error in semantic analysis: Unable to fetch table employee How to access more than one hive prompt in same system. Regards, Jaydeep The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877144#action_12877144 ] He Yongqiang commented on HIVE-417: --- Plan sounds perfectly good to me! > Implement Indexing in Hive > -- > > Key: HIVE-417 > URL: https://issues.apache.org/jira/browse/HIVE-417 > Project: Hadoop Hive > Issue Type: New Feature > Components: Metastore, Query Processor >Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 >Reporter: Prasad Chakka >Assignee: He Yongqiang > Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, > hive-indexing.3.patch > > > Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877049#action_12877049 ] Prafulla Tekawade commented on HIVE-417: I was thinking of adding something called query rewrite module. It would be rule-based query rewrite system and it would rewrite the query into semantically equivalent query which is more optimized and/or uses indexes (not just for scans, but for other query operators, e.g. GroupBy etc.) Eg. select distinct c1 from t1; This query, if we have densed index ('compact summary index' in this hive indexing patch) on c1 can be replaced with query on index table itself. select idx_key from t1_cmpct_sum_idx; Similar query transformation can happen for other queries. Module will be placed just before optimizer and will help optimizer. Module structure looks like below. [Query parser] [Query rewrites] --> new phase [Query optimization] [Query execution planner] [Query execution engine] The rewrite module is 'generic', not just for above indexing case, but for other cases too, e.g. OR predicates to union (for efficiency?), outer join to union of anti & semi joins, moving out 'order by' out of union subquery etc etc. The aim is to implement a very simple, light-weight rewrite support, implement the indexing related rewrites (above rewrite does not even need a new run-time map-red operator) and integrate indexing support quickly and cleanly. As noted above, this rewrite phase is rule-based (and not cost-based), sort of early optimization. Let me know what u think. I'll start with reading ur patch. This would do most part from TODO 1, TODO 2 and 3 will have to be looked into. > Implement Indexing in Hive > -- > > Key: HIVE-417 > URL: https://issues.apache.org/jira/browse/HIVE-417 > Project: Hadoop Hive > Issue Type: New Feature > Components: Metastore, Query Processor >Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 >Reporter: Prasad Chakka >Assignee: He Yongqiang > Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, > hive-indexing.3.patch > > > Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Question about Hadoop task side-effect files//
Hi, Using tools frameworks pig and hive already avoids this (unless you write your own stores/writers). What these do is each mapper or reducer (depending from where you write your final data to) will write to its own unique file on hdfs. Have a look at the contents of a table in hive which normally is a folder on hdfs with multiple files. Inserting to a hive table will just write another file to the folder. - Original Message - From: wuxy To: hive-dev@hadoop.apache.org Sent: Wed Jun 09 07:08:22 2010 Subject: Question about Hadoop task side-effect files// I found following section at the end of chapter 6 of the book , 'Task side-effect files'; "Care needs to be taken to ensure that multiple instances of the same task don't try to write to the same file. There are two problems to avoid: if a task failed and was retried, then the old partial output would still be present when the second task ran, and it would have to delete the old file first. Second, with speculative execution enabled, two instances of the same task could try to write to the same file simultaneously." --- In the description: "two instances of the same task could try to write to the same file simultaneously" is a case should be avoided. Can anyone confirm this for me, and if possible, tell me the reason below behind it. Thanks. Steven. Wu