Sounds good:-) I'm thinking that we can simply use the same HDFS adapter for localfs files. The HDFS API always works for local files. (The only thing needs to be done is to change to another URL prefix for local files.) In this way, we don't need to worry about how to split a super-large local file:-)
Best, Yingyi On Fri, Mar 4, 2016 at 2:31 PM, Mike Carey <[email protected]> wrote: > It would be nice to have the parallelism of loading be > dataset-property-determined rather than number-of-input-files determined > (e.g., min(number of partitions, number of input files)) and then have the > leaves of the load job each handle a delegated list of files. How hard > would that be? :-) > > > On 3/4/16 2:04 PM, Young-Seok Kim wrote: > >> That makes sense. >> >> Cheers, >> Young-Seok >> >> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <[email protected]> wrote: >> >> Young-Seok, >>> >>> That works when the number of local files is relatively small. >>> However, when the number of localfs files is 1000, the 1000 files will >>> be >>> loaded in parallel simultaneously, which will exhaust all system >>> resources. >>> Loading from HDFS doesn't have the problem because the 1000 (or more) >>> file >>> splits will be queued into each parallel loader. >>> >>> Best, >>> Yingyi >>> >>> >>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <[email protected]> >>> wrote: >>> >>> You can also load multiple adm files into a same dataset with a single >>>> >>> AQL >>> >>>> as follows: >>>> >>>> load dataset Tweets >>>> >>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter" >>>> >>>> (("path"= >>>> >>>> "130.149.249.60 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm, >>> >>>> 130.149.249.53 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm, >>> >>>> 130.149.249.54 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm, >>> >>>> 130.149.249.55 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm, >>> >>>> 130.149.249.56 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm, >>> >>>> 130.149.249.57 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm, >>> >>>> 130.149.249.58 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm, >>> >>>> 130.149.249.59 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"), >>> >>>> ("format"="adm")); >>>> >>>> >>>> The above AQL loads 8 adm files into a single dataset named Tweets. >>>> >>>> >>>> Cheers, >>>> >>>> Young-Seok >>>> >>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <[email protected]> wrote: >>>> >>>> Hi Yingyi, >>>>> >>>>> Thanks for your reply. I think the external dataset with scan query is >>>>> >>>> a >>> >>>> good solution. >>>>> I will try that. Thank you. >>>>> >>>>> Best, >>>>> Xikui >>>>> >>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <[email protected]> wrote: >>>>> >>>>> Xikui, >>>>>> >>>>>> If the number of localfs files is too large, a solution could be to >>>>>> >>>>> put >>>> >>>>> your files on HDFS and then load it. Loading from HDFS always has a >>>>>> >>>>> fixed >>>>> >>>>>> degree of parallelism regardless of the number of files. >>>>>> >>>>>> I am wondering is there a way to append adm file to existed >>>>>>>> >>>>>>> dataset? >>> >>>> You can create an external dataset and then write an insert statement >>>>>> >>>>> where >>>>> >>>>>> the body is a scan query. AsterixDB doesn't load any data into its >>>>>> >>>>> own >>> >>>> storage for an external dataset but just keeps file paths. >>>>>> Here is a manual for external datasets: >>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html >>>>>> >>>>>> Best, >>>>>> Yingyi >>>>>> >>>>>> >>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>>> >>>>>>> I want to import data from multiple adm files into a same dataset. >>>>>>> >>>>>> Merging >>>>>> >>>>>>> them together and then loading from localfs can be a viable >>>>>>> >>>>>> solution, >>> >>>> but >>>>> >>>>>> this may become a problem when the number become too large. I am >>>>>>> >>>>>> wondering >>>>>> >>>>>>> is there a way to append adm file to existed dataset? >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> Best, >>>>>>> Xikui >>>>>>> >>>>>>> >
