Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
Hm LimitedPrivate is not the intention. Those APIs (e.g. data source) are
by no means private. They are just lower level APIs whose intended audience
is library developers, not end users.

On Thu, May 12, 2016 at 8:32 PM, Sean Busbey  wrote:

> We could switch to the Audience Annotation from Apache Yetus[1], and
> then rely on Public for end-users and LimitedPrivate for those things
> we intend as lower-level things with particular non-end-user
> audiences.
>
> [1]:
> http://yetus.apache.org/documentation/in-progress/#yetus-audience-annotations
>
> On Thu, May 12, 2016 at 3:35 PM, Reynold Xin  wrote:
> > That's true. I think I want to differentiate end-user vs developer.
> Public
> > isn't the best word. Maybe EndUser?
> >
> > On Thu, May 12, 2016 at 3:34 PM, Shivaram Venkataraman
> >  wrote:
> >>
> >> On Thu, May 12, 2016 at 2:29 PM, Reynold Xin 
> wrote:
> >> > We currently have three levels of interface annotation:
> >> >
> >> > - unannotated: stable public API
> >> > - DeveloperApi: A lower-level, unstable API intended for developers.
> >> > - Experimental: An experimental user-facing API.
> >> >
> >> >
> >> > After using this annotation for ~ 2 years, I would like to propose the
> >> > following changes:
> >> >
> >> > 1. Require explicitly annotation for public APIs. This reduces the
> >> > chance of
> >> > us accidentally exposing private APIs.
> >> >
> >> +1
> >>
> >> > 2. Separate interface annotation into two components: one that
> describes
> >> > intended audience, and the other that describes stability, similar to
> >> > what
> >> > Hadoop does. This allows us to define "low level" APIs that are
> stable,
> >> > e.g.
> >> > the data source API (I'd argue this is the API that should be more
> >> > stable
> >> > than end-user-facing APIs).
> >> >
> >> > InterfaceAudience: Public, Developer
> >> >
> >> > InterfaceStability: Stable, Experimental
> >> >
> >> I'm not very sure about this. What advantage do we get from Public vs.
> >> Developer ? Also somebody needs to take a judgement call on that which
> >> might not always be easy to do
> >> >
> >> > What do you think?
> >
> >
>
>
>
> --
> busbey
>


Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Sean Busbey
We could switch to the Audience Annotation from Apache Yetus[1], and
then rely on Public for end-users and LimitedPrivate for those things
we intend as lower-level things with particular non-end-user
audiences.

[1]: 
http://yetus.apache.org/documentation/in-progress/#yetus-audience-annotations

On Thu, May 12, 2016 at 3:35 PM, Reynold Xin  wrote:
> That's true. I think I want to differentiate end-user vs developer. Public
> isn't the best word. Maybe EndUser?
>
> On Thu, May 12, 2016 at 3:34 PM, Shivaram Venkataraman
>  wrote:
>>
>> On Thu, May 12, 2016 at 2:29 PM, Reynold Xin  wrote:
>> > We currently have three levels of interface annotation:
>> >
>> > - unannotated: stable public API
>> > - DeveloperApi: A lower-level, unstable API intended for developers.
>> > - Experimental: An experimental user-facing API.
>> >
>> >
>> > After using this annotation for ~ 2 years, I would like to propose the
>> > following changes:
>> >
>> > 1. Require explicitly annotation for public APIs. This reduces the
>> > chance of
>> > us accidentally exposing private APIs.
>> >
>> +1
>>
>> > 2. Separate interface annotation into two components: one that describes
>> > intended audience, and the other that describes stability, similar to
>> > what
>> > Hadoop does. This allows us to define "low level" APIs that are stable,
>> > e.g.
>> > the data source API (I'd argue this is the API that should be more
>> > stable
>> > than end-user-facing APIs).
>> >
>> > InterfaceAudience: Public, Developer
>> >
>> > InterfaceStability: Stable, Experimental
>> >
>> I'm not very sure about this. What advantage do we get from Public vs.
>> Developer ? Also somebody needs to take a judgement call on that which
>> might not always be easy to do
>> >
>> > What do you think?
>
>



-- 
busbey

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark uses disk instead of memory to store RDD blocks

2016-05-12 Thread Takeshi Yamamuro
If you invoked the shuffling that eats a large amount of execution memory,
it possibly swept away
cached RDD blocks because the memory for the shuffling run short.
Please see:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L32

// maropu

On Fri, May 13, 2016 at 9:35 AM, Alexander Pivovarov 
wrote:

> Each executor on the screenshot has 25GB memory remaining . What was the
> reason to store 170-500 MB to disk if executor has 25GB memory available?
>
> On Thu, May 12, 2016 at 5:12 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> Not sure this is a correct answer though, seems `UnifiedMemoryManager`
>> spills
>> some blocks of RDDs into disk when execution memory runs short.
>>
>> // maropu
>>
>> On Fri, May 13, 2016 at 6:16 AM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Hello Everyone
>>>
>>> I use Spark 1.6.0 on YARN  (EMR-4.3.0)
>>>
>>> I use MEMORY_AND_DISK_SER StorageLevel for my RDD. And I use Kryo
>>> Serializer
>>>
>>> I noticed that Spark uses Disk to store some RDD blocks even if
>>> Executors have lots memory available. See the screenshot
>>> http://postimg.org/image/gxpsw1fk1/
>>>
>>> Any ideas why it might happen?
>>>
>>> Thank you
>>> Alex
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Spark uses disk instead of memory to store RDD blocks

2016-05-12 Thread Alexander Pivovarov
Each executor on the screenshot has 25GB memory remaining . What was the
reason to store 170-500 MB to disk if executor has 25GB memory available?

On Thu, May 12, 2016 at 5:12 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Not sure this is a correct answer though, seems `UnifiedMemoryManager`
> spills
> some blocks of RDDs into disk when execution memory runs short.
>
> // maropu
>
> On Fri, May 13, 2016 at 6:16 AM, Alexander Pivovarov  > wrote:
>
>> Hello Everyone
>>
>> I use Spark 1.6.0 on YARN  (EMR-4.3.0)
>>
>> I use MEMORY_AND_DISK_SER StorageLevel for my RDD. And I use Kryo
>> Serializer
>>
>> I noticed that Spark uses Disk to store some RDD blocks even if Executors
>> have lots memory available. See the screenshot
>> http://postimg.org/image/gxpsw1fk1/
>>
>> Any ideas why it might happen?
>>
>> Thank you
>> Alex
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Spark uses disk instead of memory to store RDD blocks

2016-05-12 Thread Takeshi Yamamuro
Hi,

Not sure this is a correct answer though, seems `UnifiedMemoryManager`
spills
some blocks of RDDs into disk when execution memory runs short.

// maropu

On Fri, May 13, 2016 at 6:16 AM, Alexander Pivovarov 
wrote:

> Hello Everyone
>
> I use Spark 1.6.0 on YARN  (EMR-4.3.0)
>
> I use MEMORY_AND_DISK_SER StorageLevel for my RDD. And I use Kryo
> Serializer
>
> I noticed that Spark uses Disk to store some RDD blocks even if Executors
> have lots memory available. See the screenshot
> http://postimg.org/image/gxpsw1fk1/
>
> Any ideas why it might happen?
>
> Thank you
> Alex
>



-- 
---
Takeshi Yamamuro


Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
That's true. I think I want to differentiate end-user vs developer. Public
isn't the best word. Maybe EndUser?

On Thu, May 12, 2016 at 3:34 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> On Thu, May 12, 2016 at 2:29 PM, Reynold Xin  wrote:
> > We currently have three levels of interface annotation:
> >
> > - unannotated: stable public API
> > - DeveloperApi: A lower-level, unstable API intended for developers.
> > - Experimental: An experimental user-facing API.
> >
> >
> > After using this annotation for ~ 2 years, I would like to propose the
> > following changes:
> >
> > 1. Require explicitly annotation for public APIs. This reduces the
> chance of
> > us accidentally exposing private APIs.
> >
> +1
>
> > 2. Separate interface annotation into two components: one that describes
> > intended audience, and the other that describes stability, similar to
> what
> > Hadoop does. This allows us to define "low level" APIs that are stable,
> e.g.
> > the data source API (I'd argue this is the API that should be more stable
> > than end-user-facing APIs).
> >
> > InterfaceAudience: Public, Developer
> >
> > InterfaceStability: Stable, Experimental
> >
> I'm not very sure about this. What advantage do we get from Public vs.
> Developer ? Also somebody needs to take a judgement call on that which
> might not always be easy to do
> >
> > What do you think?
>


Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Shivaram Venkataraman
On Thu, May 12, 2016 at 2:29 PM, Reynold Xin  wrote:
> We currently have three levels of interface annotation:
>
> - unannotated: stable public API
> - DeveloperApi: A lower-level, unstable API intended for developers.
> - Experimental: An experimental user-facing API.
>
>
> After using this annotation for ~ 2 years, I would like to propose the
> following changes:
>
> 1. Require explicitly annotation for public APIs. This reduces the chance of
> us accidentally exposing private APIs.
>
+1

> 2. Separate interface annotation into two components: one that describes
> intended audience, and the other that describes stability, similar to what
> Hadoop does. This allows us to define "low level" APIs that are stable, e.g.
> the data source API (I'd argue this is the API that should be more stable
> than end-user-facing APIs).
>
> InterfaceAudience: Public, Developer
>
> InterfaceStability: Stable, Experimental
>
I'm not very sure about this. What advantage do we get from Public vs.
Developer ? Also somebody needs to take a judgement call on that which
might not always be easy to do
>
> What do you think?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
We currently have three levels of interface annotation:

- unannotated: stable public API
- DeveloperApi: A lower-level, unstable API intended for developers.
- Experimental: An experimental user-facing API.


After using this annotation for ~ 2 years, I would like to propose the
following changes:

1. Require explicitly annotation for public APIs. This reduces the chance
of us accidentally exposing private APIs.

2. Separate interface annotation into two components: one that describes
intended audience, and the other that describes stability, similar to what
Hadoop does. This allows us to define "low level" APIs that are stable,
e.g. the data source API (I'd argue this is the API that should be more
stable than end-user-facing APIs).

InterfaceAudience: Public, Developer

InterfaceStability: Stable, Experimental


What do you think?


Spark uses disk instead of memory to store RDD blocks

2016-05-12 Thread Alexander Pivovarov
Hello Everyone

I use Spark 1.6.0 on YARN  (EMR-4.3.0)

I use MEMORY_AND_DISK_SER StorageLevel for my RDD. And I use Kryo Serializer

I noticed that Spark uses Disk to store some RDD blocks even if Executors
have lots memory available. See the screenshot
http://postimg.org/image/gxpsw1fk1/

Any ideas why it might happen?

Thank you
Alex


Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-12 Thread Brian Cho
Kay -- we would like to add the read metrics (in a compatible way) into our
internal DFS at Facebook and then call that method from Spark. In parallel
if you can finish up HADOOP-11873 :) , then we could add hooks to those
metrics in Spark. What do you think? Does this look like a feasible plan to
getting the metrics in?

Thanks,
Brian

On Thu, May 12, 2016 at 12:12 PM, Steve Loughran 
wrote:

>
> On 12 May 2016, at 04:44, Brian Cho  wrote:
>
> Hi Kay,
>
> Thank you for the detailed explanation.
>
> If I understand correctly, I *could* time each record processing time by
> measuring the time in reader.next, but this would add overhead for every
> single record. And this is the method that was abandoned because of
> performance regressions.
>
> The other possibility is changing HDFS first. This method looks promising
> even if it takes some time. I'll play around with it a bit for now. Thanks
> again!
>
> -Brian
>
> On Wed, May 11, 2016 at 4:45 PM, Kay Ousterhout 
> wrote:
>
>> Hi Brian,
>>
>> Unfortunately it's not possible to do this in Spark for two reasons.
>> First, we read records from Spark one at a time (e.g., if you're reading a
>> HDFS file and performing some map function, one record will be read from
>> HDFS, then the map function will be applied, then the next record will be
>> read, etc.). The relevant code is here
>> :
>> we create an iterator that's then passed on to other downstream RDDs.  As a
>> result, we'd need to time each record's processing, which adds too much
>> overhead.
>>
>> The other potential issue is that we use the RecordReader interface,
>> which means that we get deserialized and decompressed records, so any time
>> we measured would include time to read the data from disk and
>> decompress/deserialize it (not sure if you're trying to isolate the disk
>> time).
>>
>
> Measuring decompression overhead alone is interesting. Indeed, with
> encryption at rest and erasure coding in hadoop, you'd think about
> isolating work there too, to see where the bottlenecks move to after a
> switch to SSDs.
>
>
>> It *is* possible to do this instrumentation for disk read time in HDFS,
>> because HDFS reads larger blocks from disk (and then passes them to Spark
>> one by one), and I did that (in a hacky way) in the most recent commits
>> in this Hadoop branch
>> .
>> I filed a Hadoop JIRA
>> to add this (in a
>> less hacky way, using FileSystem.Statistics) but haven't submitted a patch
>> for it.  If there's sufficient interest, I could properly implement the
>> metrics and see if it could be merged into Hadoop, at which point Spark
>> could start reading those metrics (unfortunately, the delay for this would
>> be pretty significant because we'd need to wait for a new Hadoop version
>> and then a new Spark version, and it would only be available in newer
>> versions of Hadoop).
>>
>
> The metrics API changed 19 hours ago into something more sophisticated,
> though it doesn't measure timings.
>
> https://issues.apache.org/jira/browse/HADOOP-13065
>
> it's designed to be more extensible; you'll ask for a metric by name, not
> compile-time field...this will let different filesystems add different
> values
>
> A few minutes ago, https://issues.apache.org/jira/browse/HADOOP-13028 went
> in to do some metric work for spark, and there the stats can be printed in
> logs, because the filesystem and inputStream toString() operators return
> the metrics. That's for people: not machines; the text may break without
> warning. But you can at least dump the metrics in your logs to see what's
> going on. That stuff can be seen in downstream tests, but not directly
> published as metrics. The aggregate stats are also collected as metrics2
> stats, which should somehow be convertible to Coda Hale metrics, and hence
> with the rest of Spark's monitoring.
>
>
> A more straightforward action might just be for spark itself to
> subclass FilterFileSystem and implement operation timing there, both for
> operations and any input/output streams returned in create & open.
>
>


Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-12 Thread Steve Loughran

On 12 May 2016, at 04:44, Brian Cho 
> wrote:

Hi Kay,

Thank you for the detailed explanation.

If I understand correctly, I *could* time each record processing time by 
measuring the time in reader.next, but this would add overhead for every single 
record. And this is the method that was abandoned because of performance 
regressions.

The other possibility is changing HDFS first. This method looks promising even 
if it takes some time. I'll play around with it a bit for now. Thanks again!

-Brian

On Wed, May 11, 2016 at 4:45 PM, Kay Ousterhout 
> wrote:
Hi Brian,

Unfortunately it's not possible to do this in Spark for two reasons.  First, we 
read records from Spark one at a time (e.g., if you're reading a HDFS file and 
performing some map function, one record will be read from HDFS, then the map 
function will be applied, then the next record will be read, etc.). The 
relevant code is 
here:
 we create an iterator that's then passed on to other downstream RDDs.  As a 
result, we'd need to time each record's processing, which adds too much 
overhead.

The other potential issue is that we use the RecordReader interface, which 
means that we get deserialized and decompressed records, so any time we 
measured would include time to read the data from disk and 
decompress/deserialize it (not sure if you're trying to isolate the disk time).

Measuring decompression overhead alone is interesting. Indeed, with encryption 
at rest and erasure coding in hadoop, you'd think about isolating work there 
too, to see where the bottlenecks move to after a switch to SSDs.


It *is* possible to do this instrumentation for disk read time in HDFS, because 
HDFS reads larger blocks from disk (and then passes them to Spark one by one), 
and I did that (in a hacky way) in the most recent commits in this Hadoop 
branch.
  I filed a Hadoop JIRA  to 
add this (in a less hacky way, using FileSystem.Statistics) but haven't 
submitted a patch for it.  If there's sufficient interest, I could properly 
implement the metrics and see if it could be merged into Hadoop, at which point 
Spark could start reading those metrics (unfortunately, the delay for this 
would be pretty significant because we'd need to wait for a new Hadoop version 
and then a new Spark version, and it would only be available in newer versions 
of Hadoop).

The metrics API changed 19 hours ago into something more sophisticated, though 
it doesn't measure timings.

https://issues.apache.org/jira/browse/HADOOP-13065

it's designed to be more extensible; you'll ask for a metric by name, not 
compile-time field...this will let different filesystems add different values

A few minutes ago, https://issues.apache.org/jira/browse/HADOOP-13028 went in 
to do some metric work for spark, and there the stats can be printed in logs, 
because the filesystem and inputStream toString() operators return the metrics. 
That's for people: not machines; the text may break without warning. But you 
can at least dump the metrics in your logs to see what's going on. That stuff 
can be seen in downstream tests, but not directly published as metrics. The 
aggregate stats are also collected as metrics2 stats, which should somehow be 
convertible to Coda Hale metrics, and hence with the rest of Spark's monitoring.


A more straightforward action might just be for spark itself to subclass 
FilterFileSystem and implement operation timing there, both for operations and 
any input/output streams returned in create & open.



Re: [build system] short downtime next thursday morning, 5-12-16 @ 8am PDT

2016-05-12 Thread shane knapp
ok, i've decided to roll back the upgrade and do this again early next
week.  some of the new features/security fixes break the pull request
builder, so i will need to revisit my plan.

sorry for the downtime -- we're back up and running now.

On Thu, May 12, 2016 at 8:41 AM, shane knapp  wrote:
> things are looking good -- i'm backing up the entire jenkins
> installation right now (just in case), so that's taking a while to
> finish.
>
> i'm doing the backup as LTS has finally surpassed the version we're
> on, so i'm taking this opportunity to move this installation to LTS.
>
> shane
>
> On Thu, May 12, 2016 at 8:00 AM, shane knapp  wrote:
>> this is happening now.
>>
>> On Wed, May 11, 2016 at 4:42 PM, shane knapp  wrote:
>>> reminder:  this is happening tomorrow morning!
>>>
>>> 7am PDT:  builds paused
>>> 8am PDT:  master reboot, upgrade happens
>>> 9am PDT:  builds restarted
>>>
>>> On Mon, May 9, 2016 at 4:17 PM, shane knapp  wrote:
 reminder:  this is happening thursday morning.

 On Wed, May 4, 2016 at 11:38 AM, shane knapp  wrote:
> there's a security update coming out for jenkins next week, and i'm
> going to install the update first thing thursday morning.
>
> i'll send out another reminder early next week.
>
> thanks!
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Cache Shuffle Based Operation Before Sort

2016-05-12 Thread Takeshi Yamamuro
Hi,

Look interesting.
This optimisation also seems effective in case of simply loading and
sorting df;
val df = sqlCtx.read.load(path)
df.cache.sort("some colum")

How big does this optimisation have effects on actual performance?
If big, it'd be better to open JIRA.

// maropu

On Mon, May 9, 2016 at 2:21 PM, Ted Yu  wrote:

> I assume there were supposed to be images following this line (which I
> don't see in the email thread):
>
> bq. Let’s look at details of execution for 10 and 100 scale factor input
>
> Consider using 3rd party image site.
>
> On Sun, May 8, 2016 at 5:17 PM, Ali Tootoonchian  wrote:
>
>> Thanks for your comment.
>> Which image or chart are you pointing?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Cache-Shuffle-Based-Operation-Before-Sort-tp17331p17438.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: [build system] short downtime next thursday morning, 5-12-16 @ 8am PDT

2016-05-12 Thread shane knapp
things are looking good -- i'm backing up the entire jenkins
installation right now (just in case), so that's taking a while to
finish.

i'm doing the backup as LTS has finally surpassed the version we're
on, so i'm taking this opportunity to move this installation to LTS.

shane

On Thu, May 12, 2016 at 8:00 AM, shane knapp  wrote:
> this is happening now.
>
> On Wed, May 11, 2016 at 4:42 PM, shane knapp  wrote:
>> reminder:  this is happening tomorrow morning!
>>
>> 7am PDT:  builds paused
>> 8am PDT:  master reboot, upgrade happens
>> 9am PDT:  builds restarted
>>
>> On Mon, May 9, 2016 at 4:17 PM, shane knapp  wrote:
>>> reminder:  this is happening thursday morning.
>>>
>>> On Wed, May 4, 2016 at 11:38 AM, shane knapp  wrote:
 there's a security update coming out for jenkins next week, and i'm
 going to install the update first thing thursday morning.

 i'll send out another reminder early next week.

 thanks!

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [build system] short downtime next thursday morning, 5-12-16 @ 8am PDT

2016-05-12 Thread shane knapp
this is happening now.

On Wed, May 11, 2016 at 4:42 PM, shane knapp  wrote:
> reminder:  this is happening tomorrow morning!
>
> 7am PDT:  builds paused
> 8am PDT:  master reboot, upgrade happens
> 9am PDT:  builds restarted
>
> On Mon, May 9, 2016 at 4:17 PM, shane knapp  wrote:
>> reminder:  this is happening thursday morning.
>>
>> On Wed, May 4, 2016 at 11:38 AM, shane knapp  wrote:
>>> there's a security update coming out for jenkins next week, and i'm
>>> going to install the update first thing thursday morning.
>>>
>>> i'll send out another reminder early next week.
>>>
>>> thanks!
>>>
>>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



How Spark SQL correctly connect hive metastore database with Spark 2.0 ?

2016-05-12 Thread james
Hi Spark guys,
I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master
code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an
issue that it always connect local derby database and can't connect my
existing hive metastore database. Could you help me to check what's the root
cause ? What's specific configuration for integration with hive metastore in
Spark 2.0 ? BTW, this case is OK in Spark 1.6.

Build package command:
./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6
-Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests

Key configurations in spark-defaults.conf:
spark.sql.hive.metastore.version=1.1.0
spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
spark.executor.extraClassPath=/etc/hive/conf
spark.driver.extraClassPath=/etc/hive/conf
spark.yarn.jars=local:/usr/lib/spark/jars/*

There is existing hive metastore database named by "test_sparksql". I always
got error "metastore.ObjectStore: Failed to get database test_sparksql,
returning NoSuchObjectException" after issuing 'use test_sparksql'. Please
see below steps for details.
 
$ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name
hive.enable.spark.execution.engine does not exist
16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered,
and you are trying to register an identical plugin located at URL
"file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple JAR
versions of the same plugin in the classpath. The URL
"file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered,
and you are trying to register an identical plugin located at URL
"file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already
registered, and you are trying to register an identical plugin located at
URL "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
16/05/12 22:23:30 INFO DataNucleus.Persistence: Property
datanucleus.cache.level2 unknown - will be ignored
16/05/12 22:23:30 INFO DataNucleus.Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name
hive.enable.spark.execution.engine does not exist
16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin
classes with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/05/12 22:23:32 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
16/05/12 22:23:32 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only"
so does not have its own datastore table.
16/05/12 22:23:33 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
16/05/12 22:23:33 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only"
so does not have its own datastore table.
16/05/12 22:23:33 INFO metastore.MetaStoreDirectSql: Using direct SQL,
underlying DB is DERBY
16/05/12 22:23:33 INFO metastore.ObjectStore: Initialized ObjectStore
16/05/12 22:23:33 WARN metastore.ObjectStore: Version information not found
in metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 1.2.0
16/05/12 22:23:33 WARN metastore.ObjectStore: Failed to get database
default, returning NoSuchObjectException
16/05/12 22:23:34 INFO 

Spark Exposing RDD as WebService ?

2016-05-12 Thread Senthil Kumar
Hi All , I have a requirement to Process huge file ( 75 GB ) ..

  Here is the sample data :
  
   
 100
 spark.conf
  .
  .
  .
   
  

   

99989796
  


  Steps :
1)Load complete 
2)Load INodeDirectorySection
3)Iterate each INode and Query InodeSection as well as
InodeDirectory Section to know the Parents ( till ROOT directory )


  Currently i have done this , as below
  1) Load Inodes to Redis
  2) Load InodeDirectorySection to Redis
  3) For each Inode Query Redis and compute the Parents

   The number of Inodes close to 200 Million so the Job is not
completing within SLA.. I have max SLA as 2-2.5 Hours for this Operation.

   How do i use Spark here and Expose RDD as Service for my
requirement ??  Can this be done with Other methodologies ? ..

--Senthil