I had this issue before ... I tried to load 23K files (I know it's ridiculous) and the job failed. But when coalescing them to 1000 files, the load worked just fine. It would be nice also to do repartitioning when loading one large file to speedup the parsing process.
I remember similar issue appeared in Spark: https://github.com/mesos/spark/pull/718 On Sun, Feb 21, 2016 at 8:52 AM, Till Westmann <[email protected]> wrote: > Sounds like a good candidate for a JIRA issue, so we won't forget. :) > > Cheers, > Till > > > On Feb 20, 2016, at 21:44, abdullah alamoudi <[email protected]> wrote: > > > > Totally agree. Probably better make sure it works nicely with that many > > tasks and then fix the number of readers. > > > > Cheers, > > Abdullah. > > > >> On Sun, Feb 21, 2016 at 2:04 AM, Mike Carey <[email protected]> wrote: > >> > >> Sounds like the load job parallelism needs a redo - it probably > shouldn't > >> be more than the number of target partitions IMO...? > >>> On Feb 20, 2016 12:41 PM, "abdullah alamoudi" <[email protected]> > wrote: > >>> > >>> I have an idea that might explain why such a strange behavior > happened. I > >>> believe it could be due to the number of task partitions being very > high > >>> assuming each of the 76 files is being read in a separate task. > >>> This could potentially lead to some corner cases that we didn't > consider > >>> before considering the number of threads in the tasks thread pool is > less > >>> than 76, some tasks will not be able to start until others have > completed > >>> execution. > >>> > >>> Just a thought, > >>> Abdullah. > >>> > >>> On Fri, Feb 19, 2016 at 9:43 PM, abdullah alamoudi <[email protected] > > > >>> wrote: > >>> > >>>> Yiran, > >>>> Here is one problem causing a failure: > >>>> edu.uci.ics.hyracks.api.exceptions.HyracksDataException: > >>>> edu.uci.ics.hyracks.api.exceptions.HyracksDataException: > >> > edu.uci.ics.hyracks.storage.am.common.exceptions.TreeIndexDuplicateKeyException: > >>>> Input stream given to BTree bulk load has duplicates. > >>>> > >>>> which tells us that Input stream given to BTree bulk load has > >> duplicates. > >>>> The question is why this was not returned as the error message? We > need > >>> to > >>>> look into that. > >>>> > >>>> I will continue looking at the log file to see if there were other > >>> issues. > >>>> > >>>> Can you share with us the load statement you're using? I would like to > >>> see > >>>> how you're loading all the files. we might be able to suggest a way to > >>> make > >>>> it work better. > >>>> > >>>> Cheers, > >>>> Abdullah. > >>>> > >>>>> On Fri, Feb 19, 2016 at 9:31 PM, Yiran Wang <[email protected]> > wrote: > >>>>> > >>>>> Abdullah, > >>>>> > >>>>> Here is the log attached. Thank you all very much for looking into > >> this. > >>>>> > >>>>> Ian - I have two query questions besides this loading issue. I was > >>>>> wondering if I can meet briefly with you (or over email) regarding > >> that. > >>>>> > >>>>> Thanks! > >>>>> Yiran > >>>>> > >>>>> On Fri, Feb 19, 2016 at 9:38 AM, Mike Carey <[email protected]> > >> wrote: > >>>>> > >>>>>> Maybe Ian can visit the cluster with Yiran later today? > >>>>>> On Feb 19, 2016 1:31 AM, "abdullah alamoudi" <[email protected]> > >>> wrote: > >>>>>> > >>>>>>> Yiran, > >>>>>>> Can you share the logs? It would help us identifying the actual > >> cause > >>>>>>> of this failure much faster. > >>>>>>> > >>>>>>> I am pretty sure you know this but in case you didn't, you can get > >> the > >>>>>>> logs using > >>>>>>>> managix log -n <instance-name> > >>>>>>> > >>>>>>> Also, it would be nice if someone from the team has access to the > >>>>>>> cluster so we can work with it directly. > >>>>>>> Cheers, > >>>>>>> Abdullah. > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Feb 19, 2016 at 9:40 AM, Yiran Wang <[email protected]> > >>> wrote: > >>>>>>> > >>>>>>>> Steven, > >>>>>>>> > >>>>>>>> Thanks for getting back to me so quickly! I wasn't clear. Here is > >>> what > >>>>>>>> happened: > >>>>>>>> > >>>>>>>> I test-loaded the first 32 files, no problem. I deleted the > >> dataset, > >>>>>>>> created a new one, and tried to load the entire 76 files into the > >>> newly > >>>>>>>> created (hence empty) dataset. > >>>>>>>> > >>>>>>>> It took about 2mins after executing the query for the error > message > >>> to > >>>>>>>> show up. There are currently 31710406 rows of data in the dataset, > >>> despite > >>>>>>>> the error message (so it looks like it did load). > >>>>>>>> > >>>>>>>> So my questions are: 1) why did I still get that error message > >> when I > >>>>>>>> was loading to an empty dataset; and 2) I'm not sure if all the > >> data > >>> from > >>>>>>>> the 76 file are fully loaded. Is there other ways to check, > besides > >>> trying > >>>>>>>> to load it again and hope this time I don't get the error? > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> Yiran > >>>>>>>> > >>>>>>>> On Thu, Feb 18, 2016 at 10:29 PM, Steven Jacobs <[email protected] > > > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> Welcome! We are an Apache incubator project now so I added the > >>>>>>>>> correct mailing list. Our "load" statement only works on an empty > >>> dataset. > >>>>>>>>> Subsequent data needs to be added with an insert or a feed. You > >>> should be > >>>>>>>>> able to load all 76 files at once though (starting from empty). > >>>>>>>>> Steven > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thursday, February 18, 2016, Yiran Wang <[email protected]> > >>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Asterix team! > >>>>>>>>>> > >>>>>>>>>> I've come across this error when I was trying to load 76 files > >> into > >>>>>>>>>> a dataset. When I test-loaded the first 32 files, there wasn't > >>> such an > >>>>>>>>>> error. All 76 files are of the same data format. > >>>>>>>>>> > >>>>>>>>>> Can you help interpret what this error message means? > >>>>>>>>>> > >>>>>>>>>> Thanks! > >>>>>>>>>> Yiran > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Best, > >>>>>>>>>> Yiran > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> You received this message because you are subscribed to the > >> Google > >>>>>>>>>> Groups "asterixdb-dev" group. > >>>>>>>>>> To unsubscribe from this group and stop receiving emails from > it, > >>>>>>>>>> send an email to [email protected]. > >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. > >>>>>>>>> -- > >>>>>>>>> You received this message because you are subscribed to the > Google > >>>>>>>>> Groups "asterixdb-users" group. > >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, > >>>>>>>>> send an email to [email protected]. > >>>>>>>>> For more options, visit https://groups.google.com/d/optout. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Best, > >>>>>>>> Yiran > >>>>>>>> > >>>>>>>> -- > >>>>>>>> You received this message because you are subscribed to the Google > >>>>>>>> Groups "asterixdb-dev" group. > >>>>>>>> To unsubscribe from this group and stop receiving emails from it, > >>> send > >>>>>>>> an email to [email protected]. > >>>>>>>> For more options, visit https://groups.google.com/d/optout. > >>>>>>> > >>>>>>> -- > >>>>>>> You received this message because you are subscribed to the Google > >>>>>>> Groups "asterixdb-dev" group. > >>>>>>> To unsubscribe from this group and stop receiving emails from it, > >> send > >>>>>>> an email to [email protected]. > >>>>>>> For more options, visit https://groups.google.com/d/optout. > >>>>>> -- > >>>>>> You received this message because you are subscribed to the Google > >>>>>> Groups "asterixdb-users" group. > >>>>>> To unsubscribe from this group and stop receiving emails from it, > >> send > >>>>>> an email to [email protected]. > >>>>>> For more options, visit https://groups.google.com/d/optout. > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Best, > >>>>> Yiran > >>>>> > >>>>> -- > >>>>> You received this message because you are subscribed to the Google > >>> Groups > >>>>> "asterixdb-dev" group. > >>>>> To unsubscribe from this group and stop receiving emails from it, > send > >>> an > >>>>> email to [email protected]. > >>>>> For more options, visit https://groups.google.com/d/optout. > >> > -- *Regards,* Wail Alkowaileet
