Re: Do we have a method to append local files to existed dataset?

abdullah alamoudi Sat, 05 Mar 2016 02:13:36 -0800

Not hard at all. (about 5 minutes of work).

Will create a change for it.


On Sat, Mar 5, 2016 at 1:31 AM, Mike Carey <[email protected]> wrote:

> It would be nice to have the parallelism of loading be
> dataset-property-determined rather than number-of-input-files determined
> (e.g., min(number of partitions, number of input files)) and then have the
> leaves of the load job each handle a delegated list of files.  How hard
> would that be?  :-)
>
> On 3/4/16 2:04 PM, Young-Seok Kim wrote:
>
>> That makes sense.
>>
>> Cheers,
>> Young-Seok
>>
>> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <[email protected]> wrote:
>>
>> Young-Seok,
>>>
>>> That works when the number of local files is relatively small.
>>> However, when the number of localfs files is 1000,  the 1000 files will
>>> be
>>> loaded in parallel simultaneously, which will exhaust all system
>>> resources.
>>> Loading from HDFS doesn't have the problem because the 1000 (or more)
>>> file
>>> splits will be queued into each parallel loader.
>>>
>>> Best,
>>> Yingyi
>>>
>>>
>>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <[email protected]>
>>> wrote:
>>>
>>> You can also load multiple adm files into a same dataset with a single
>>>>
>>> AQL
>>>
>>>> as follows:
>>>>
>>>> load dataset Tweets
>>>>
>>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>>>>
>>>> (("path"=
>>>>
>>>> "130.149.249.60
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>>>
>>>> 130.149.249.53
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>>>
>>>> 130.149.249.54
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>>>
>>>> 130.149.249.55
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>>>
>>>> 130.149.249.56
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>>>
>>>> 130.149.249.57
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>>>
>>>> 130.149.249.58
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>>>
>>>> 130.149.249.59
>>>>
>>>>
>>>>
>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>>>
>>>> ("format"="adm"));
>>>>
>>>>
>>>> The above AQL loads 8 adm files into a single dataset named Tweets.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Young-Seok
>>>>
>>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <[email protected]> wrote:
>>>>
>>>> Hi Yingyi,
>>>>>
>>>>> Thanks for your reply. I think the external dataset with scan query is
>>>>>
>>>> a
>>>
>>>> good solution.
>>>>> I will try that. Thank you.
>>>>>
>>>>> Best,
>>>>> Xikui
>>>>>
>>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <[email protected]> wrote:
>>>>>
>>>>> Xikui,
>>>>>>
>>>>>> If the number of localfs files is too large,  a solution could be to
>>>>>>
>>>>> put
>>>>
>>>>> your files on HDFS and then load it.  Loading from HDFS always has a
>>>>>>
>>>>> fixed
>>>>>
>>>>>> degree of parallelism regardless of the number of files.
>>>>>>
>>>>>> I am wondering is there a way to append adm file to existed
>>>>>>>>
>>>>>>> dataset?
>>>
>>>> You can create an external dataset and then write an insert statement
>>>>>>
>>>>> where
>>>>>
>>>>>> the body is a scan query. AsterixDB doesn't load any data into its
>>>>>>
>>>>> own
>>>
>>>> storage for an external dataset but just keeps file paths.
>>>>>> Here is a manual for external datasets:
>>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>>>>>>
>>>>>> Best,
>>>>>> Yingyi
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I want to import data from multiple adm files into a same dataset.
>>>>>>>
>>>>>> Merging
>>>>>>
>>>>>>> them together and then loading from localfs can be a viable
>>>>>>>
>>>>>> solution,
>>>
>>>> but
>>>>>
>>>>>> this may become a problem when the number become too large. I am
>>>>>>>
>>>>>> wondering
>>>>>>
>>>>>>> is there a way to append adm file to existed dataset?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> Best,
>>>>>>> Xikui
>>>>>>>
>>>>>>>
>

Re: Do we have a method to append local files to existed dataset?

Reply via email to