Path to resource added with SQL: ADD FILE

2016-02-03 Thread Antonio Piccolboni
Sorry if this is more appropriate for user list, I asked there on 12/17 and
got the silence treatment. I am writing a UDF that needs some additional
info to perform its task. This information is in a file that I reference in
a SQL ADD FILE statement. I expect that file to be accessible in the
working directory for the UDF, but it doesn't seem to work (aka, failure on
open("./my resource file"). What is the correct way to access the added
resource? Thanks


Antonio


Re: Path to resource added with SQL: ADD FILE

2016-02-04 Thread Antonio Piccolboni
Hi Herman,
thanks for your reply, I used an absolute path to add the file. I use a
relative path only to access it in the UDF. I am not sure what absolute
path I should use in the UDF, if any. The documentation
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources>
I am referring to is Beeline for Hive's, I hope it's the relevant bit.
Therein I read

add FILE /tmp/tt.py;

select from networks a MAP a.networkid USING 'python tt.py' as nn
where a.ds = '2009-01-04' limit 10;

So it's just referring to the file by basename, which I think is
equivalent to "./tt.py". I tried to follow this example closely, just
using R instead of python and the name of the file is hard-coded, not
an argument. If what you mean is that I should provide the same
absolute path in the ADD FILE and in the SELECT, then I will try that.
Thanks


Antonio


On Thu, Feb 4, 2016 at 4:21 AM Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hi Antonio,
>
> I am not sure you got the silent treatment on the user list. Stackoverflow
> is also a good place to ask questions.
>
> Could you use an absolute path to add the jar file. So instead of './my
> resource file' (which is a relative path; this depends on where you
> started Spark), use something like this '/some/path/my resource file' or
> use an URI.
>
> Kind regards,
>
> Herman van Hövell
>
>
> 2016-02-03 19:17 GMT+01:00 Antonio Piccolboni :
>
>> Sorry if this is more appropriate for user list, I asked there on 12/17
>> and got the silence treatment. I am writing a UDF that needs some
>> additional info to perform its task. This information is in a file that I
>> reference in a SQL ADD FILE statement. I expect that file to be accessible
>> in the working directory for the UDF, but it doesn't seem to work (aka,
>> failure on open("./my resource file"). What is the correct way to access
>> the added resource? Thanks
>>
>>
>> Antonio
>>
>
>


Re: groupByKey() and keys with many values

2015-09-07 Thread Antonio Piccolboni
To expand on what Sean said, I would look into replacing groupByKey with
reduceByKey. Also take a look at this doc
.
I happen to have designed a library that was subject to the same criticism
when compared to the java mapreduce API wrt the use of iterables, but
neither we nor the critics could ever find a natural example of a problem
when you can express a computation as a single pass through each group
while using a constant amount of memory  that could not be converted to
using a combiner (mapreduce jargon, called a reduce in Spark and most
functional circles). If  you found such an example, while an obstacle for
you,  it would be of some  interest to know what it is.


On Mon, Sep 7, 2015 at 1:31 AM Sean Owen  wrote:

> That's how it's intended to work; if it's a problem, you probably need
> to re-design your computation to not use groupByKey. Usually you can
> do so.
>
> On Mon, Sep 7, 2015 at 9:02 AM, kaklakariada 
> wrote:
> > Hi,
> >
> > I already posted this question on the users mailing list
> > (
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html
> )
> > but did not get a reply. Maybe this is the correct forum to ask.
> >
> > My problem is, that doing groupByKey().mapToPair() loads all values for a
> > key into memory which is a problem when the values don't fit into memory.
> > This was not a problem with Hadoop map/reduce, as the Iterable passed to
> the
> > reducer read from disk.
> >
> > In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer
> > containing all values.
> >
> > Is it possible to change this behavior without modifying Spark, or is
> there
> > a plan to change this?
> >
> > Thank you very much for your help!
> > Christoph.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: groupByKey() and keys with many values

2015-09-08 Thread Antonio Piccolboni
You may also consider selecting distinct keys and fetching from database
first, then join on key with values. This in case Sean's approach is not
viable -- in case you need to have the DB data before the first reduce
call. By not revealing your problem, you are forcing us to make guesses,
which are less useful. Imagine you want to compute a binning of the values
on a per key basis. The bin definitions are in the database. Then the
reduce would be updating counts per bin.  You could let the reduce
initialize the bin counts from DB when empty. This will result in multiple
database accesses and connections per key, and the higher the degree of
parallelism, the bigger the cost (see this
 elementary
example), which is something you should avoid if you want to write code
with some durability to it. If you use the join approach, you can select
the keys, unique them and perform data base access to obtain bin defs. Now
join the data file with the bin file on key. Then pass this through a
reduceByKey to update the bin counts. Different application, you want to
compute max min values per key and want to compare with previously recored
max min, then store the overall max min. Then you don't need the data based
values during the reduce. You just fetch them in the foreachPartition,
before each write.

As far as the DB writes,  remember spark can retry a computation, so your
writes have to be idempotent (see this thread
, in which
Reynold is a bit optimistic about failures than I am comfortable with, but
who am I to question Reynold?)






On Tue, Sep 8, 2015 at 12:53 AM Sean Owen  wrote:

> I think groupByKey is intended for cases where you do want the values
> in memory; for one-pass use cases, it's more efficient to use
> reduceByKey, or aggregateByKey if lower-level operations are needed.
>
> For your case, you probably want to do you reduceByKey, then perform
> the expensive per-key lookups once per key. You also probably want to
> do this in foreachPartition, not foreach, in order to pay DB
> connection costs just once per partition.
>
> On Tue, Sep 8, 2015 at 7:20 AM, kaklakariada 
> wrote:
> > Hi Antonio!
> >
> > Thank you very much for your answer!
> > You are right in that in my case the computation could be replaced by a
> > reduceByKey. The thing is that my computation also involves database
> > queries:
> >
> > 1. Fetch key-specific data from database into memory. This is expensive
> and
> > I only want to do this once for a key.
> > 2. Process each value using this data and update the common data
> > 3. Store modified data to database. Here it is important to write all
> data
> > for a key in one go.
> >
> > Is there a pattern how to implement something like this with reduceByKey?
> >
> > Out of curiosity: I understand why you want to discourage people from
> using
> > groupByKey. But is there a technical reason why the Iterable is
> implemented
> > the way it is?
> >
> > Kind regards,
> > Christoph.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985p13992.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>