Not hard at all. (about 5 minutes of work). Will create a change for it.
On Sat, Mar 5, 2016 at 1:31 AM, Mike Carey <[email protected]> wrote: > It would be nice to have the parallelism of loading be > dataset-property-determined rather than number-of-input-files determined > (e.g., min(number of partitions, number of input files)) and then have the > leaves of the load job each handle a delegated list of files. How hard > would that be? :-) > > On 3/4/16 2:04 PM, Young-Seok Kim wrote: > >> That makes sense. >> >> Cheers, >> Young-Seok >> >> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <[email protected]> wrote: >> >> Young-Seok, >>> >>> That works when the number of local files is relatively small. >>> However, when the number of localfs files is 1000, the 1000 files will >>> be >>> loaded in parallel simultaneously, which will exhaust all system >>> resources. >>> Loading from HDFS doesn't have the problem because the 1000 (or more) >>> file >>> splits will be queued into each parallel loader. >>> >>> Best, >>> Yingyi >>> >>> >>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <[email protected]> >>> wrote: >>> >>> You can also load multiple adm files into a same dataset with a single >>>> >>> AQL >>> >>>> as follows: >>>> >>>> load dataset Tweets >>>> >>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter" >>>> >>>> (("path"= >>>> >>>> "130.149.249.60 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm, >>> >>>> 130.149.249.53 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm, >>> >>>> 130.149.249.54 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm, >>> >>>> 130.149.249.55 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm, >>> >>>> 130.149.249.56 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm, >>> >>>> 130.149.249.57 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm, >>> >>>> 130.149.249.58 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm, >>> >>>> 130.149.249.59 >>>> >>>> >>>> >>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"), >>> >>>> ("format"="adm")); >>>> >>>> >>>> The above AQL loads 8 adm files into a single dataset named Tweets. >>>> >>>> >>>> Cheers, >>>> >>>> Young-Seok >>>> >>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <[email protected]> wrote: >>>> >>>> Hi Yingyi, >>>>> >>>>> Thanks for your reply. I think the external dataset with scan query is >>>>> >>>> a >>> >>>> good solution. >>>>> I will try that. Thank you. >>>>> >>>>> Best, >>>>> Xikui >>>>> >>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <[email protected]> wrote: >>>>> >>>>> Xikui, >>>>>> >>>>>> If the number of localfs files is too large, a solution could be to >>>>>> >>>>> put >>>> >>>>> your files on HDFS and then load it. Loading from HDFS always has a >>>>>> >>>>> fixed >>>>> >>>>>> degree of parallelism regardless of the number of files. >>>>>> >>>>>> I am wondering is there a way to append adm file to existed >>>>>>>> >>>>>>> dataset? >>> >>>> You can create an external dataset and then write an insert statement >>>>>> >>>>> where >>>>> >>>>>> the body is a scan query. AsterixDB doesn't load any data into its >>>>>> >>>>> own >>> >>>> storage for an external dataset but just keeps file paths. >>>>>> Here is a manual for external datasets: >>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html >>>>>> >>>>>> Best, >>>>>> Yingyi >>>>>> >>>>>> >>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>>> >>>>>>> I want to import data from multiple adm files into a same dataset. >>>>>>> >>>>>> Merging >>>>>> >>>>>>> them together and then loading from localfs can be a viable >>>>>>> >>>>>> solution, >>> >>>> but >>>>> >>>>>> this may become a problem when the number become too large. I am >>>>>>> >>>>>> wondering >>>>>> >>>>>>> is there a way to append adm file to existed dataset? >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> Best, >>>>>>> Xikui >>>>>>> >>>>>>> >
