Re: Do we have a method to append local files to existed dataset?

Mike Carey Sat, 05 Mar 2016 16:24:06 -0800

:-)  Thx!!

On 3/5/16 2:12 AM, abdullah alamoudi wrote:

Not hard at all. (about 5 minutes of work).


Will create a change for it.

On Sat, Mar 5, 2016 at 1:31 AM, Mike Carey <[email protected]> wrote:

It would be nice to have the parallelism of loading be
dataset-property-determined rather than number-of-input-files determined
(e.g., min(number of partitions, number of input files)) and then have the
leaves of the load job each handle a delegated list of files.  How hard
would that be?  :-)

On 3/4/16 2:04 PM, Young-Seok Kim wrote:

That makes sense.

Cheers,
Young-Seok

On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <[email protected]> wrote:

Young-Seok,

That works when the number of local files is relatively small.
However, when the number of localfs files is 1000,  the 1000 files will
be
loaded in parallel simultaneously, which will exhaust all system
resources.
Loading from HDFS doesn't have the problem because the 1000 (or more)
file
splits will be queued into each parallel loader.

Best,
Yingyi


On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <[email protected]>
wrote:

You can also load multiple adm files into a same dataset with a single
AQL

as follows:

load dataset Tweets

using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"

(("path"=

"130.149.249.60

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,

130.149.249.53

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,

130.149.249.54

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,

130.149.249.55

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,

130.149.249.56

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,

130.149.249.57

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,

130.149.249.58

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,

130.149.249.59

:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),

("format"="adm"));


The above AQL loads 8 adm files into a single dataset named Tweets.


Cheers,

Young-Seok

On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <[email protected]> wrote:

Hi Yingyi,

Thanks for your reply. I think the external dataset with scan query is

a
good solution.

I will try that. Thank you.

Best,
Xikui

On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <[email protected]> wrote:

Xikui,

If the number of localfs files is too large,  a solution could be to

put
your files on HDFS and then load it.  Loading from HDFS always has a
fixed

degree of parallelism regardless of the number of files.

I am wondering is there a way to append adm file to existed

dataset?

You can create an external dataset and then write an insert statement

where

the body is a scan query. AsterixDB doesn't load any data into its

own

storage for an external dataset but just keeps file paths.

Here is a manual for external datasets:
https://ci.apache.org/projects/asterixdb/aql/externaldata.html

Best,
Yingyi


On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <[email protected]> wrote:

Hi,

I want to import data from multiple adm files into a same dataset.

Merging

them together and then loading from localfs can be a viable

solution,

but

this may become a problem when the number become too large. I am
wondering

is there a way to append adm file to existed dataset?

Thank you.

Best,
Xikui

Re: Do we have a method to append local files to existed dataset?

Reply via email to