Re: Do we have a method to append local files to existed dataset?

abdullah alamoudi Sat, 05 Mar 2016 02:13:13 -0800

You shouldn't get that error even if you're using the localfs. I will
double check that.


On Sat, Mar 5, 2016 at 2:41 AM, Xikui Wang <[email protected]> wrote:

> Hi,
>
> @Young-Seok, Thanks for noticing. This is quite convenient for loading
> small batch files.
>
> @Yingyi, Thanks for pointing out the limitations. I tried with my datasets
> (700 x 50MB per file),
> and it drained all system resources as you expected. Actually the mechanism
> that you mentioned
> HDFS like localfs is what I am looking for. That would be useful for
> standalone users. Or maybe we just
> don't care standalone users since they are too small. :)
>
> @abdullah, I tried directory path, but it doesn't go through. It raises '
> xxx is a directory error'. I guess it's
> because I am using localfs?
>
> Best,
> Xikui
>
> On Fri, Mar 4, 2016 at 2:28 PM, abdullah alamoudi <[email protected]>
> wrote:
>
> > You can however specify the directory in the path parameter and not the
> > individual files and they will be processed sequentially (or 1 thread per
> > specified path).
> >
> > On Sat, Mar 5, 2016 at 1:04 AM, Young-Seok Kim <[email protected]>
> wrote:
> >
> > > That makes sense.
> > >
> > > Cheers,
> > > Young-Seok
> > >
> > > On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <[email protected]> wrote:
> > >
> > > > Young-Seok,
> > > >
> > > > That works when the number of local files is relatively small.
> > > > However, when the number of localfs files is 1000,  the 1000 files
> will
> > > be
> > > > loaded in parallel simultaneously, which will exhaust all system
> > > resources.
> > > > Loading from HDFS doesn't have the problem because the 1000 (or more)
> > > file
> > > > splits will be queued into each parallel loader.
> > > >
> > > > Best,
> > > > Yingyi
> > > >
> > > >
> > > > On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <[email protected]>
> > > wrote:
> > > >
> > > > > You can also load multiple adm files into a same dataset with a
> > single
> > > > AQL
> > > > > as follows:
> > > > >
> > > > > load dataset Tweets
> > > > >
> > > > > using
> > "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
> > > > >
> > > > > (("path"=
> > > > >
> > > > > "130.149.249.60
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
> > > > >
> > > > > 130.149.249.53
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
> > > > >
> > > > > 130.149.249.54
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
> > > > >
> > > > > 130.149.249.55
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
> > > > >
> > > > > 130.149.249.56
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
> > > > >
> > > > > 130.149.249.57
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
> > > > >
> > > > > 130.149.249.58
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
> > > > >
> > > > > 130.149.249.59
> > > > >
> > > > >
> > > >
> > >
> >
> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
> > > > >
> > > > > ("format"="adm"));
> > > > >
> > > > >
> > > > > The above AQL loads 8 adm files into a single dataset named Tweets.
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Young-Seok
> > > > >
> > > > > On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <[email protected]>
> wrote:
> > > > >
> > > > > > Hi Yingyi,
> > > > > >
> > > > > > Thanks for your reply. I think the external dataset with scan
> query
> > > is
> > > > a
> > > > > > good solution.
> > > > > > I will try that. Thank you.
> > > > > >
> > > > > > Best,
> > > > > > Xikui
> > > > > >
> > > > > > On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Xikui,
> > > > > > >
> > > > > > > If the number of localfs files is too large,  a solution could
> be
> > > to
> > > > > put
> > > > > > > your files on HDFS and then load it.  Loading from HDFS always
> > has
> > > a
> > > > > > fixed
> > > > > > > degree of parallelism regardless of the number of files.
> > > > > > >
> > > > > > > >> I am wondering is there a way to append adm file to existed
> > > > dataset?
> > > > > > > You can create an external dataset and then write an insert
> > > statement
> > > > > > where
> > > > > > > the body is a scan query. AsterixDB doesn't load any data into
> > its
> > > > own
> > > > > > > storage for an external dataset but just keeps file paths.
> > > > > > > Here is a manual for external datasets:
> > > > > > > https://ci.apache.org/projects/asterixdb/aql/externaldata.html
> > > > > > >
> > > > > > > Best,
> > > > > > > Yingyi
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I want to import data from multiple adm files into a same
> > > dataset.
> > > > > > > Merging
> > > > > > > > them together and then loading from localfs can be a viable
> > > > solution,
> > > > > > but
> > > > > > > > this may become a problem when the number become too large. I
> > am
> > > > > > > wondering
> > > > > > > > is there a way to append adm file to existed dataset?
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Xikui
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Do we have a method to append local files to existed dataset?

Reply via email to