I'm confused: Why is "self-join" (or any join) an issue? I think the
alledged case (:-)) against self-join is equivalent to a case against
ever doing any queries against any data set under any circumstances
where data is being inserted.... I don't think we want to restrict the
system to only querying read-only datasets...
A feed lets you run a query against the system based on the contents of
a current incoming record R. Unless I am missing something (which is
not unlikely because it's been a long day and I just got home from
traveling :-)), this is equivalent to:
let $r = ... (picture a constant constructor that yields the same
content as R) ...
return feed_processor ($r)
Right? I.e., the new record R is not yet in the dataset - so - what's
the issue? What's special about this?
Cheers,
Mike
PS - Again, apologies if a long day has led to extra cluelessness on my
part...
On 12/8/15 9:52 PM, abdullah alamoudi wrote:
I think that we probably should restrict feed applied functions somehow
(needs further thoughts and discussions) and I know for sure that we don't.
As for the case you present, I would imagine that it could be allowed
theoretically but I think everyone sees why it should be disallowed.
One thing to keep in mind is that we introduce a materialize if the dataset
was part of an insert pipeline. Now think about how this would work with a
continuous feed. One choice would be that the feed will materialize all
records to be inserted and once the feed stops, it would start inserting
them but I still think we should not allow it.
My 2c,
Any opposing argument?
Amoudi, Abdullah.
On Tue, Dec 8, 2015 at 6:28 PM, Ildar Absalyamov <[email protected]
wrote:
Hi All,
As a part of feed ingestion we do allow preprocessing incoming data with
AQL UDFs.
I was wondering if we somehow restrict the kind of UDFs that could be
used? Do we allow joins in these UDFs? Especially joins with the same
dataset, which is used for intake. Ex:
create type TweetType as open {
id: string,
username : string,
location : string,
text : string,
timestamp : string
}
create dataset Tweets(TweetType)
primary key id;
create function feed_processor($x) {
for $y in dataset Tweets
// self-join with Tweets dataset on some predicate($x, $y)
return $y
}
create feed TweetFeed
apply function feed_processor;
The query above fails in runtime, but I was wondering if that
theoretically could work at all.
Best regards,
Ildar