That works when the number of local files is relatively small.
However, when the number of localfs files is 1000, the 1000 files will
be
loaded in parallel simultaneously, which will exhaust all system
resources.
Loading from HDFS doesn't have the problem because the 1000 (or more)
file
splits will be queued into each parallel loader.
Best,
Yingyi
On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <[email protected]>
wrote:
You can also load multiple adm files into a same dataset with a single
AQL
as follows:
load dataset Tweets
using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
(("path"=
"130.149.249.60
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
130.149.249.53
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
130.149.249.54
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
130.149.249.55
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
130.149.249.56
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
130.149.249.57
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
130.149.249.58
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
130.149.249.59
:///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
("format"="adm"));
The above AQL loads 8 adm files into a single dataset named Tweets.
Cheers,
Young-Seok
On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <[email protected]> wrote:
Hi Yingyi,
Thanks for your reply. I think the external dataset with scan query is
a
good solution.
I will try that. Thank you.
Best,
Xikui
On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <[email protected]> wrote:
Xikui,
If the number of localfs files is too large, a solution could be to
put
your files on HDFS and then load it. Loading from HDFS always has a
fixed
degree of parallelism regardless of the number of files.
I am wondering is there a way to append adm file to existed
dataset?
You can create an external dataset and then write an insert statement
where
the body is a scan query. AsterixDB doesn't load any data into its
own
storage for an external dataset but just keeps file paths.
Here is a manual for external datasets:
https://ci.apache.org/projects/asterixdb/aql/externaldata.html
Best,
Yingyi
On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <[email protected]> wrote:
Hi,
I want to import data from multiple adm files into a same dataset.
Merging
them together and then loading from localfs can be a viable
solution,
but
this may become a problem when the number become too large. I am
wondering
is there a way to append adm file to existed dataset?
Thank you.
Best,
Xikui