Re: Open Issues for Contributors

2015-09-22 Thread Luciano Resende
You can use Jira filters to narrow down the scope of issues you want to
possible address, for instance, I use this filter to look into open issues,
that are unassigned :

https://issues.apache.org/jira/issues/?filter=12333428

For a specific release, you can also filter the release, and I Reynold had
sent this a few days ago for 1.5.1

https://issues.apache.org/jira/issues/?filter=1221


On Tue, Sep 22, 2015 at 8:50 AM, Pedro Rodriguez 
wrote:

> Where is the best place to look at open issues that haven't been
> assigned/started for the next release? I am interested in working on
> something, but I don't know what issues are higher priority for the next
> release.
>
> On a similar note, is there somewhere which outlines the overall goals for
> the next release (be it 1.5.1 or 1.6) with some parent issues along with
> smaller child issues to work on (like the built ins ticket from 1.5)?
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Why there is no snapshots for 1.5 branch?

2015-09-22 Thread Patrick Wendell
I just added snapshot builds for 1.5. They will take a few hours to
build, but once we get them working should publish every few hours.

https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging

- Patrick

On Mon, Sep 21, 2015 at 10:36 PM, Bin Wang  wrote:
> However I find some scripts in dev/audit-release, can I use them?
>
> Bin Wang 于2015年9月22日周二 下午1:34写道:
>>
>> No, I mean push spark to my private repository. Spark don't have a
>> build.sbt as far as I see.
>>
>> Fengdong Yu 于2015年9月22日周二 下午1:29写道:
>>>
>>> Do you mean you want to publish the artifact to your private repository?
>>>
>>> if so, please using ‘sbt publish’
>>>
>>> add the following in your build.sb:
>>>
>>> publishTo := {
>>>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
>>>   if (version.value.endsWith("SNAPSHOT"))
>>> Some("snapshots" at nexus + "content/repositories/snapshots")
>>>   else
>>> Some("releases"  at nexus + "content/repositories/releases")
>>>
>>> }
>>>
>>>
>>>
>>> On Sep 22, 2015, at 13:26, Bin Wang  wrote:
>>>
>>> My project is using sbt (or maven), which need to download dependency
>>> from a maven repo. I have my own private maven repo with nexus but I don't
>>> know how to push my own build to it, can you give me a hint?
>>>
>>> Mark Hamstra 于2015年9月22日周二 下午1:25写道:

 Yeah, whoever is maintaining the scripts and snapshot builds has fallen
 down on the job -- but there is nothing preventing you from checking out
 branch-1.5 and creating your own build, which is arguably a smarter thing 
 to
 do anyway.  If I'm going to use a non-release build, then I want the full
 git commit history of exactly what is in that build readily available, not
 just somewhat arbitrary JARs.

 On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  wrote:
>
> But I cannot find 1.5.1-SNAPSHOT either at
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>
> Mark Hamstra 于2015年9月22日周二 下午12:55写道:
>>
>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
>> The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 
>> release
>> candidates and then the 1.5.1 release.
>>
>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:
>>>
>>> I'd like to use some important bug fixes in 1.5 branch and I look for
>>> the apache maven host, but don't find any snapshot for 1.5 branch.
>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>
>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
>>> 1.5.X?
>>
>>

>>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-22 Thread shane knapp
ok, here's the updated downtime schedule for this week:

wednesday, sept 23rd:

firewall maintenance cancelled, as jon took care of the update
saturday morning while we were bringing jenkins back up after the colo
fire

thursday, sept 24th:

jenkins maintenance is still scheduled, but abbreviated as some of the
maintenance was performed saturday morning as well
* new builds will stop being accepted ~630am PDT
  - i'll kill any hangers-on at 730am, and after maintenance is done,
i will retrigger any killed jobs
* jenkins worker system package updates
  - amp-jenkins-master was completed on saturday
  - this will NOT include kernel updates as moving to
2.6.32-573.3.1.el6 bricked amp-jenkins-master
* moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79
* all systems get a reboot
* expected downtime:  3.5 hours or so

i'll post updates as i progress.

also, i'll post a copy of our post-mortem once the dust settles.  it's
been, shall we say, a pretty crazy few days.

http://news.berkeley.edu/2015/09/19/campus-network-outage/

:)

On Mon, Sep 21, 2015 at 10:11 AM, shane knapp  wrote:
> quick update:  we actually did some of the maintenance on our systems
> after the berkeley-wide outage caused by one of our (non-jenkins)
> servers halting and catching fire.
>
> we'll still have some downtime early wednesday, but tomorrow's will be
> cancelled.  i'll send out another update real soon now with what we'll
> be covering on wednesday once we get our current situation more under
> control.  :)
>
> On Wed, Sep 16, 2015 at 12:15 PM, shane knapp  wrote:
>>> 630am-10am thursday, 9-24-15:
>>> * jenknins update to 1.629 (we're a few months behind in versions, and
>>> some big bugs have been fixed)
>>> * jenkins master and worker system package updates
>>> * all systems get a reboot (lots of hanging java processes have been
>>> building up over the months)
>>> * builds will stop being accepted ~630am, and i'll kill any hangers-on
>>> at 730am, and retrigger once we're done
>>> * expected downtime:  3.5 hours or so
>>> * i will also be testing out some of my shiny new ansible playbooks
>>> for the system updates!
>>>
>> i forgot one thing:
>>
>> * moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Open Issues for Contributors

2015-09-22 Thread Pedro Rodriguez
Thanks for the links (first one is broken or private).

I think the main mistake I was making was looking at fix version instead of
target version (JIRA homepage with listings of versions links to fix
versions).

For anyone else interested in MLlib things, I am looking at this to see
what goals are: https://issues.apache.org/jira/browse/SPARK-10324

On Tue, Sep 22, 2015 at 11:10 AM, Luciano Resende 
wrote:

> You can use Jira filters to narrow down the scope of issues you want to
> possible address, for instance, I use this filter to look into open issues,
> that are unassigned :
>
> https://issues.apache.org/jira/issues/?filter=12333428
>
> For a specific release, you can also filter the release, and I Reynold had
> sent this a few days ago for 1.5.1
>
> https://issues.apache.org/jira/issues/?filter=1221
>
>
> On Tue, Sep 22, 2015 at 8:50 AM, Pedro Rodriguez 
> wrote:
>
>> Where is the best place to look at open issues that haven't been
>> assigned/started for the next release? I am interested in working on
>> something, but I don't know what issues are higher priority for the next
>> release.
>>
>> On a similar note, is there somewhere which outlines the overall goals
>> for the next release (be it 1.5.1 or 1.6) with some parent issues along
>> with smaller child issues to work on (like the built ins ticket from 1.5)?
>>
>> Thanks,
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


column identifiers in Spark SQL

2015-09-22 Thread Richard Hillegas


I am puzzled by the behavior of column identifiers in Spark SQL. I don't
find any guidance in the "Spark SQL and DataFrame Guide" at
http://spark.apache.org/docs/latest/sql-programming-guide.html. I am seeing
odd behavior related to case-sensitivity and to delimited (quoted)
identifiers.

Consider the following declaration of a table in the Derby relational
database, whose dialect hews closely to the SQL Standard:

   create table app.t( a int, "b" int, "c""d" int );

Now let's load that table into Spark like this:

  import org.apache.spark.sql._
  import org.apache.spark.sql.types._

  val df = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
"dbtable" -> "app.t")).load()
  df.registerTempTable("test_data")

The following query runs fine because the column name matches the
normalized form in which it is stored in the metadata catalogs of the
relational database:

  // normalized column names are recognized
  sqlContext.sql(s"""select A from test_data""").show

But the following query fails during name resolution. This puzzles me
because non-delimited identifiers are case-insensitive in the ANSI/ISO
Standard. They are also supposed to be case-insensitive in HiveQL, at least
according to section 2.3.1 of the QuotedIdentifier.html webpage attached to
https://issues.apache.org/jira/browse/HIVE-6013:

  // ...unnormalized column names raise this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns A, b, c"d;
  sqlContext.sql("""select a from test_data""").show

Delimited (quoted) identifiers are treated as string literals. Again,
non-Standard behavior:

  // this returns rows consisting of the string literal "b"
  sqlContext.sql("""select "b" from test_data""").show

Embedded quotes in delimited identifiers won't even parse:

  // embedded quotes raise this error: java.lang.RuntimeException: [1.11]
failure: ``union'' expected but "d" found
  sqlContext.sql("""select "c""d" from test_data""").show

This behavior is non-Standard and it strikes me as hard to describe to
users concisely. Would the community support an effort to bring the
handling of column identifiers into closer conformance with the Standard?
Would backward compatibility concerns even allow us to do that?

Thanks,
-Rick

Re: column identifiers in Spark SQL

2015-09-22 Thread Michael Armbrust
Are you using a SQLContext or a HiveContext?  The programming guide
suggests the latter, as the former is really only there because some
applications may have conflicts with Hive dependencies.  SQLContext is case
sensitive by default where as the HiveContext is not.  The parser in
HiveContext is also a lot better.

On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas 
wrote:

> I am puzzled by the behavior of column identifiers in Spark SQL. I don't
> find any guidance in the "Spark SQL and DataFrame Guide" at
> http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> seeing odd behavior related to case-sensitivity and to delimited (quoted)
> identifiers.
>
> Consider the following declaration of a table in the Derby relational
> database, whose dialect hews closely to the SQL Standard:
>
>create table app.t( a int, "b" int, "c""d" int );
>
> Now let's load that table into Spark like this:
>
>   import org.apache.spark.sql._
>   import org.apache.spark.sql.types._
>
>   val df = sqlContext.read.format("jdbc").options(
> Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> "dbtable" -> "app.t")).load()
>   df.registerTempTable("test_data")
>
> The following query runs fine because the column name matches the
> normalized form in which it is stored in the metadata catalogs of the
> relational database:
>
>   // normalized column names are recognized
>   sqlContext.sql(s"""select A from test_data""").show
>
> But the following query fails during name resolution. This puzzles me
> because non-delimited identifiers are case-insensitive in the ANSI/ISO
> Standard. They are also supposed to be case-insensitive in HiveQL, at least
> according to section 2.3.1 of the QuotedIdentifier.html webpage attached to
> https://issues.apache.org/jira/browse/HIVE-6013:
>
>   // ...unnormalized column names raise this error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
> columns A, b, c"d;
>   sqlContext.sql("""select a from test_data""").show
>
> Delimited (quoted) identifiers are treated as string literals. Again,
> non-Standard behavior:
>
>   // this returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> Embedded quotes in delimited identifiers won't even parse:
>
>   // embedded quotes raise this error: java.lang.RuntimeException: [1.11]
> failure: ``union'' expected but "d" found
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> This behavior is non-Standard and it strikes me as hard to describe to
> users concisely. Would the community support an effort to bring the
> handling of column identifiers into closer conformance with the Standard?
> Would backward compatibility concerns even allow us to do that?
>
> Thanks,
> -Rick
>


Re: column identifiers in Spark SQL

2015-09-22 Thread Michael Armbrust
HiveQL uses `backticks` for quoted identifiers.

On Tue, Sep 22, 2015 at 1:06 PM, Richard Hillegas 
wrote:

> Thanks for that tip, Michael. I think that my sqlContext was a raw
> SQLContext originally. I have rebuilt Spark like so...
>
>   sbt/sbt -Phive assembly/assembly
>
> Now I see that my sqlContext is a HiveContext. That fixes one of the
> queries. Now unnormalized column names work:
>
>   // ...unnormalized column names work now
>   sqlContext.sql("""select a from test_data""").show
>
> However, quoted identifiers are still treated as string literals:
>
>   // this still returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> And embedded quotes inside quoted identifiers are swallowed up:
>
>   // this now returns rows consisting of the string literal "cd"
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> Thanks,
> -Rick
>
> Michael Armbrust  wrote on 09/22/2015 10:58:36 AM:
>
> > From: Michael Armbrust 
> > To: Richard Hillegas/San Francisco/IBM@IBMUS
> > Cc: Dev 
> > Date: 09/22/2015 10:59 AM
> > Subject: Re: column identifiers in Spark SQL
>
> >
> > Are you using a SQLContext or a HiveContext?  The programming guide
> > suggests the latter, as the former is really only there because some
> > applications may have conflicts with Hive dependencies.  SQLContext
> > is case sensitive by default where as the HiveContext is not.  The
> > parser in HiveContext is also a lot better.
> >
> > On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas 
> wrote:
> > I am puzzled by the behavior of column identifiers in Spark SQL. I
> > don't find any guidance in the "Spark SQL and DataFrame Guide" at
> > http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> > seeing odd behavior related to case-sensitivity and to delimited
> > (quoted) identifiers.
> >
> > Consider the following declaration of a table in the Derby
> > relational database, whose dialect hews closely to the SQL Standard:
> >
> >create table app.t( a int, "b" int, "c""d" int );
> >
> > Now let's load that table into Spark like this:
> >
> >   import org.apache.spark.sql._
> >   import org.apache.spark.sql.types._
> >
> >   val df = sqlContext.read.format("jdbc").options(
> > Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> > "dbtable" -> "app.t")).load()
> >   df.registerTempTable("test_data")
> >
> > The following query runs fine because the column name matches the
> > normalized form in which it is stored in the metadata catalogs of
> > the relational database:
> >
> >   // normalized column names are recognized
> >   sqlContext.sql(s"""select A from test_data""").show
> >
> > But the following query fails during name resolution. This puzzles
> > me because non-delimited identifiers are case-insensitive in the
> > ANSI/ISO Standard. They are also supposed to be case-insensitive in
> > HiveQL, at least according to section 2.3.1 of the
> > QuotedIdentifier.html webpage attached to https://issues.apache.org/
> > jira/browse/HIVE-6013:
> >
> >   // ...unnormalized column names raise this error:
> > org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> > input columns A, b, c"d;
> >   sqlContext.sql("""select a from test_data""").show
> >
> > Delimited (quoted) identifiers are treated as string literals.
> > Again, non-Standard behavior:
> >
> >   // this returns rows consisting of the string literal "b"
> >   sqlContext.sql("""select "b" from test_data""").show
> >
> > Embedded quotes in delimited identifiers won't even parse:
> >
> >   // embedded quotes raise this error: java.lang.RuntimeException:
> > [1.11] failure: ``union'' expected but "d" found
> >   sqlContext.sql("""select "c""d" from test_data""").show
> >
> > This behavior is non-Standard and it strikes me as hard to describe
> > to users concisely. Would the community support an effort to bring
> > the handling of column identifiers into closer conformance with the
> > Standard? Would backward compatibility concerns even allow us to do that?
> >
> > Thanks,
> > -Rick
>
>


Re: SparkR package path

2015-09-22 Thread Shivaram Venkataraman
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui  wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:fal...@gmail.com]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Derby version in Spark

2015-09-22 Thread Ted Yu
Which Spark release are you building ?

For master branch, I get the following:

lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
 lib_managed/jars/datanucleus-core-3.2.10.jar
 lib_managed/jars/datanucleus-rdbms-3.2.9.jar

FYI

On Tue, Sep 22, 2015 at 1:28 PM, Richard Hillegas 
wrote:

> I see that lib_managed/jars holds these old Derby versions:
>
>   lib_managed/jars/derby-10.10.1.1.jar
>   lib_managed/jars/derby-10.10.2.0.jar
>
> The Derby 10.10 release family supports some ancient JVMs: Java SE 5 and
> Java ME CDC/Foundation Profile 1.1. It's hard to imagine anyone running
> Spark on the resource-constrained Java ME platform. Is Spark really
> deployed on Java SE 5? Is there some other reason that Spark uses the 10.10
> Derby family?
>
> If no-one needs those ancient JVMs, maybe we could consider changing the
> Derby version to 10.11.1.1 or even to the upcoming 10.12.1.1 release (both
> run on Java 6 and up).
>
> Thanks,
> -Rick
>


Derby version in Spark

2015-09-22 Thread Richard Hillegas


I see that lib_managed/jars holds these old Derby versions:

  lib_managed/jars/derby-10.10.1.1.jar
  lib_managed/jars/derby-10.10.2.0.jar

The Derby 10.10 release family supports some ancient JVMs: Java SE 5 and
Java ME CDC/Foundation Profile 1.1. It's hard to imagine anyone running
Spark on the resource-constrained Java ME platform. Is Spark really
deployed on Java SE 5? Is there some other reason that Spark uses the 10.10
Derby family?

If no-one needs those ancient JVMs, maybe we could consider changing the
Derby version to 10.11.1.1 or even to the upcoming 10.12.1.1 release (both
run on Java 6 and up).

Thanks,
-Rick

Re: column identifiers in Spark SQL

2015-09-22 Thread Richard Hillegas

Thanks for that tip, Michael. I think that my sqlContext was a raw
SQLContext originally. I have rebuilt Spark like so...

  sbt/sbt -Phive assembly/assembly

Now I see that my sqlContext is a HiveContext. That fixes one of the
queries. Now unnormalized column names work:

  // ...unnormalized column names work now
  sqlContext.sql("""select a from test_data""").show

However, quoted identifiers are still treated as string literals:

  // this still returns rows consisting of the string literal "b"
  sqlContext.sql("""select "b" from test_data""").show

And embedded quotes inside quoted identifiers are swallowed up:

  // this now returns rows consisting of the string literal "cd"
  sqlContext.sql("""select "c""d" from test_data""").show

Thanks,
-Rick

Michael Armbrust  wrote on 09/22/2015 10:58:36 AM:

> From: Michael Armbrust 
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: Dev 
> Date: 09/22/2015 10:59 AM
> Subject: Re: column identifiers in Spark SQL
>
> Are you using a SQLContext or a HiveContext?  The programming guide
> suggests the latter, as the former is really only there because some
> applications may have conflicts with Hive dependencies.  SQLContext
> is case sensitive by default where as the HiveContext is not.  The
> parser in HiveContext is also a lot better.
>
> On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas 
wrote:
> I am puzzled by the behavior of column identifiers in Spark SQL. I
> don't find any guidance in the "Spark SQL and DataFrame Guide" at
> http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> seeing odd behavior related to case-sensitivity and to delimited
> (quoted) identifiers.
>
> Consider the following declaration of a table in the Derby
> relational database, whose dialect hews closely to the SQL Standard:
>
>    create table app.t( a int, "b" int, "c""d" int );
>
> Now let's load that table into Spark like this:
>
>   import org.apache.spark.sql._
>   import org.apache.spark.sql.types._
>
>   val df = sqlContext.read.format("jdbc").options(
>     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
>     "dbtable" -> "app.t")).load()
>   df.registerTempTable("test_data")
>
> The following query runs fine because the column name matches the
> normalized form in which it is stored in the metadata catalogs of
> the relational database:
>
>   // normalized column names are recognized
>   sqlContext.sql(s"""select A from test_data""").show
>
> But the following query fails during name resolution. This puzzles
> me because non-delimited identifiers are case-insensitive in the
> ANSI/ISO Standard. They are also supposed to be case-insensitive in
> HiveQL, at least according to section 2.3.1 of the
> QuotedIdentifier.html webpage attached to https://issues.apache.org/
> jira/browse/HIVE-6013:
>
>   // ...unnormalized column names raise this error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> input columns A, b, c"d;
>   sqlContext.sql("""select a from test_data""").show
>
> Delimited (quoted) identifiers are treated as string literals.
> Again, non-Standard behavior:
>
>   // this returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> Embedded quotes in delimited identifiers won't even parse:
>
>   // embedded quotes raise this error: java.lang.RuntimeException:
> [1.11] failure: ``union'' expected but "d" found
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> This behavior is non-Standard and it strikes me as hard to describe
> to users concisely. Would the community support an effort to bring
> the handling of column identifiers into closer conformance with the
> Standard? Would backward compatibility concerns even allow us to do that?
>
> Thanks,
> -Rick

Fwd: Parallel collection in driver programs

2015-09-22 Thread Andy Huang
Hi Devs,

Hopefully one of you know more on this?

Thanks

Andy
-- Forwarded message --
From: Andy Huang 
Date: Wed, Sep 23, 2015 at 12:39 PM
Subject: Parallel collection in driver programs
To: u...@spark.apache.org


Hi All,

Would like know if anyone has experienced with parallel collection in the
driver program. And, if there is actual advantage/disadvantage of doing so.

E.g. With a collection of Jdbc connections and tables

We have adapted our non-spark code which utilize parallel collection to the
spark code and it seems to work fine.

val conf = List(
  ("tbl1","dbo.tbl1::tb1_id::0::127::128"),
  ("tbl2","dbo.tbl2::tb2_id::0::31::32"),
  ("tbl3","dbo.tbl3::tb3_id::0::63::64")
)

val _JDBC_DEFAULT = "jdbc:sqlserver://192.168.52.1;database=TestSource"
val _STORE_DEFAULT = "hdfs://192.168.52.132:9000/"

val prop = new Properties()
prop.setProperty("user","sa")
prop.setProperty("password","password")

conf.par.map(pair=>{

  val qry = pair._2.split("::")(0)
  val pCol = pair._2.split("::")(1)
  val lo = pair._2.split("::")(2).toInt
  val hi = pair._2.split("::")(3).toInt
  val part = pair._2.split("::")(4).toInt

  //create dataframe from jdbc table
  val jdbcDF = sqlContext.read.jdbc(
_JDBC_DEFAULT,
"("+qry+") a",
pCol,
lo, //lower bound
hi, //upper bound
part, //number of partitions
prop //java.utils.Properties - key value pair
  )

  //save to parquet
  jdbcDF.write.mode("overwrite").parquet(_STORE_DEFAULT+pair._1+".parquet")

})


Thanks.
-- 
Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
f: 02 9376 0730| m: 0433221979



-- 
Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
f: 02 9376 0730| m: 0433221979


Re: Derby version in Spark

2015-09-22 Thread Ted Yu
I see.
I use maven to build so I observe different contents under lib_managed
directory.

Here is snippet of dependency tree:

[INFO] |  +- org.spark-project.hive:hive-metastore:jar:1.2.1.spark:compile
[INFO] |  |  +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile
[INFO] |  |  +- org.apache.derby:derby:jar:10.10.1.1:compile

On Tue, Sep 22, 2015 at 3:21 PM, Richard Hillegas 
wrote:

> Thanks, Ted. I'm working on my master branch. The lib_managed/jars
> directory has a lot of jarballs, including hadoop and hive. Maybe these
> were faulted in when I built with the following command?
>
>   sbt/sbt -Phive assembly/assembly
>
> The Derby jars seem to be used in order to manage the metastore_db
> database. Maybe my question should be directed to the Hive community?
>
> Thanks,
> -Rick
>
> Here are the gory details:
>
> bash-3.2$ ls lib_managed/jars
> FastInfoset-1.2.12.jar curator-test-2.4.0.jar
> jersey-test-framework-grizzly2-1.9.jar parquet-format-2.3.0-incubating.jar
> JavaEWAH-0.3.2.jar datanucleus-api-jdo-3.2.6.jar jets3t-0.7.1.jar
> parquet-generator-1.7.0.jar
> ST4-4.0.4.jar datanucleus-core-3.2.10.jar
> jetty-continuation-8.1.14.v20131031.jar parquet-hadoop-1.7.0.jar
> activation-1.1.jar datanucleus-rdbms-3.2.9.jar
> jetty-http-8.1.14.v20131031.jar parquet-hadoop-bundle-1.6.0.jar
> akka-actor_2.10-2.3.11.jar derby-10.10.1.1.jar
> jetty-io-8.1.14.v20131031.jar parquet-jackson-1.7.0.jar
> akka-remote_2.10-2.3.11.jar derby-10.10.2.0.jar
> jetty-jndi-8.1.14.v20131031.jar platform-3.4.0.jar
> akka-slf4j_2.10-2.3.11.jar genjavadoc-plugin_2.10.4-0.9-spark0.jar
> jetty-plus-8.1.14.v20131031.jar pmml-agent-1.1.15.jar
> akka-testkit_2.10-2.3.11.jar groovy-all-2.1.6.jar
> jetty-security-8.1.14.v20131031.jar pmml-model-1.1.15.jar
> antlr-2.7.7.jar guava-11.0.2.jar jetty-server-8.1.14.v20131031.jar
> pmml-schema-1.1.15.jar
> antlr-runtime-3.4.jar guice-3.0.jar jetty-servlet-8.1.14.v20131031.jar
> postgresql-9.3-1102-jdbc41.jar
> aopalliance-1.0.jar h2-1.4.183.jar jetty-util-6.1.26.jar py4j-0.8.2.1.jar
> arpack_combined_all-0.1-javadoc.jar hadoop-annotations-2.2.0.jar
> jetty-util-8.1.14.v20131031.jar pyrolite-4.4.jar
> arpack_combined_all-0.1.jar hadoop-auth-2.2.0.jar
> jetty-webapp-8.1.14.v20131031.jar quasiquotes_2.10-2.0.0.jar
> asm-3.2.jar hadoop-client-2.2.0.jar jetty-websocket-8.1.14.v20131031.jar
> reflectasm-1.07-shaded.jar
> avro-1.7.4.jar hadoop-common-2.2.0.jar jetty-xml-8.1.14.v20131031.jar
> sac-1.3.jar
> avro-1.7.7.jar hadoop-hdfs-2.2.0.jar jline-0.9.94.jar
> scala-compiler-2.10.0.jar
> avro-ipc-1.7.7-tests.jar hadoop-mapreduce-client-app-2.2.0.jar
> jline-2.10.4.jar scala-compiler-2.10.4.jar
> avro-ipc-1.7.7.jar hadoop-mapreduce-client-common-2.2.0.jar jline-2.12.jar
> scala-library-2.10.4.jar
> avro-mapred-1.7.7-hadoop2.jar hadoop-mapreduce-client-core-2.2.0.jar
> jna-3.4.0.jar scala-reflect-2.10.4.jar
> breeze-macros_2.10-0.11.2.jar hadoop-mapreduce-client-jobclient-2.2.0.jar
> joda-time-2.5.jar scalacheck_2.10-1.11.3.jar
> breeze_2.10-0.11.2.jar hadoop-mapreduce-client-shuffle-2.2.0.jar
> jodd-core-3.5.2.jar scalap-2.10.0.jar
> calcite-avatica-1.2.0-incubating.jar hadoop-yarn-api-2.2.0.jar
> json-20080701.jar selenium-api-2.42.2.jar
> calcite-core-1.2.0-incubating.jar hadoop-yarn-client-2.2.0.jar
> json-20090211.jar selenium-chrome-driver-2.42.2.jar
> calcite-linq4j-1.2.0-incubating.jar hadoop-yarn-common-2.2.0.jar
> json4s-ast_2.10-3.2.10.jar selenium-firefox-driver-2.42.2.jar
> cglib-2.2.1-v20090111.jar hadoop-yarn-server-common-2.2.0.jar
> json4s-core_2.10-3.2.10.jar selenium-htmlunit-driver-2.42.2.jar
> cglib-nodep-2.1_3.jar hadoop-yarn-server-nodemanager-2.2.0.jar
> json4s-jackson_2.10-3.2.10.jar selenium-ie-driver-2.42.2.jar
> chill-java-0.5.0.jar hamcrest-core-1.1.jar jsr173_api-1.0.jar
> selenium-java-2.42.2.jar
> chill_2.10-0.5.0.jar hamcrest-core-1.3.jar jsr305-1.3.9.jar
> selenium-remote-driver-2.42.2.jar
> commons-beanutils-1.7.0.jar hamcrest-library-1.3.jar jsr305-2.0.1.jar
> selenium-safari-driver-2.42.2.jar
> commons-beanutils-core-1.8.0.jar hive-exec-1.2.1.spark.jar jta-1.1.jar
> selenium-support-2.42.2.jar
> commons-cli-1.2.jar hive-metastore-1.2.1.spark.jar jtransforms-2.4.0.jar
> serializer-2.7.1.jar
> commons-codec-1.10.jar htmlunit-2.14.jar jul-to-slf4j-1.7.10.jar
> slf4j-api-1.7.10.jar
> commons-codec-1.4.jar htmlunit-core-js-2.14.jar junit-4.10.jar
> slf4j-log4j12-1.7.10.jar
> commons-codec-1.5.jar httpclient-4.3.2.jar junit-dep-4.10.jar
> snappy-0.2.jar
> commons-codec-1.9.jar httpcore-4.3.1.jar junit-dep-4.8.2.jar
> spire-macros_2.10-0.7.4.jar
> commons-collections-3.2.1.jar httpmime-4.3.2.jar junit-interface-0.10.jar
> spire_2.10-0.7.4.jar
> commons-compiler-2.7.8.jar istack-commons-runtime-2.16.jar
> junit-interface-0.9.jar stax-api-1.0.1.jar
> commons-compress-1.4.1.jar ivy-2.4.0.jar libfb303-0.9.2.jar
> stream-2.7.0.jar
> commons-configuration-1.6.jar jackson-core-asl-1.8.8.jar
> libthrift-0.9.2.jar stringtemplate-3.2.1.jar
> commons-dbcp-1.4.jar 

Re: Derby version in Spark

2015-09-22 Thread Richard Hillegas

Thanks, Ted. I'll follow up with the Hive folks.

Cheers,
-Rick

Ted Yu  wrote on 09/22/2015 03:41:12 PM:

> From: Ted Yu 
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: Dev 
> Date: 09/22/2015 03:41 PM
> Subject: Re: Derby version in Spark
>
> I cloned Hive 1.2 code base and saw:
>
>     10.10.2.0
>
> So the version used by Spark is quite close to what Hive uses.
>
> On Tue, Sep 22, 2015 at 3:29 PM, Ted Yu  wrote:
> I see.
> I use maven to build so I observe different contents under
> lib_managed directory.
>
> Here is snippet of dependency tree:
>
> [INFO] |  +-
org.spark-project.hive:hive-metastore:jar:1.2.1.spark:compile
> [INFO] |  |  +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile
> [INFO] |  |  +- org.apache.derby:derby:jar:10.10.1.1:compile
>
> On Tue, Sep 22, 2015 at 3:21 PM, Richard Hillegas 
wrote:
> Thanks, Ted. I'm working on my master branch. The lib_managed/jars
> directory has a lot of jarballs, including hadoop and hive. Maybe
> these were faulted in when I built with the following command?
>
>   sbt/sbt -Phive assembly/assembly
>
> The Derby jars seem to be used in order to manage the metastore_db
> database. Maybe my question should be directed to the Hive community?
>
> Thanks,
> -Rick
>
> Here are the gory details:
>
> bash-3.2$ ls lib_managed/jars
> FastInfoset-1.2.12.jar curator-test-2.4.0.jar jersey-test-framework-
> grizzly2-1.9.jar parquet-format-2.3.0-incubating.jar
> JavaEWAH-0.3.2.jar datanucleus-api-jdo-3.2.6.jar jets3t-0.7.1.jar
> parquet-generator-1.7.0.jar
> ST4-4.0.4.jar datanucleus-core-3.2.10.jar jetty-continuation-8.1.
> 14.v20131031.jar parquet-hadoop-1.7.0.jar
> activation-1.1.jar datanucleus-rdbms-3.2.9.jar jetty-http-8.1.
> 14.v20131031.jar parquet-hadoop-bundle-1.6.0.jar
> akka-actor_2.10-2.3.11.jar derby-10.10.1.1.jar jetty-io-8.1.
> 14.v20131031.jar parquet-jackson-1.7.0.jar
> akka-remote_2.10-2.3.11.jar derby-10.10.2.0.jar jetty-jndi-8.1.
> 14.v20131031.jar platform-3.4.0.jar
> akka-slf4j_2.10-2.3.11.jar genjavadoc-plugin_2.10.4-0.9-spark0.jar
> jetty-plus-8.1.14.v20131031.jar pmml-agent-1.1.15.jar
> akka-testkit_2.10-2.3.11.jar groovy-all-2.1.6.jar jetty-security-8.
> 1.14.v20131031.jar pmml-model-1.1.15.jar
> antlr-2.7.7.jar guava-11.0.2.jar jetty-server-8.1.14.v20131031.jar
> pmml-schema-1.1.15.jar
> antlr-runtime-3.4.jar guice-3.0.jar jetty-servlet-8.1.
> 14.v20131031.jar postgresql-9.3-1102-jdbc41.jar
> aopalliance-1.0.jar h2-1.4.183.jar jetty-util-6.1.26.jar py4j-0.8.2.1.jar
> arpack_combined_all-0.1-javadoc.jar hadoop-annotations-2.2.0.jar
> jetty-util-8.1.14.v20131031.jar pyrolite-4.4.jar
> arpack_combined_all-0.1.jar hadoop-auth-2.2.0.jar jetty-webapp-8.1.
> 14.v20131031.jar quasiquotes_2.10-2.0.0.jar
> asm-3.2.jar hadoop-client-2.2.0.jar jetty-websocket-8.1.
> 14.v20131031.jar reflectasm-1.07-shaded.jar
> avro-1.7.4.jar hadoop-common-2.2.0.jar jetty-xml-8.1.
> 14.v20131031.jar sac-1.3.jar
> avro-1.7.7.jar hadoop-hdfs-2.2.0.jar jline-0.9.94.jar scala-
> compiler-2.10.0.jar
> avro-ipc-1.7.7-tests.jar hadoop-mapreduce-client-app-2.2.0.jar
> jline-2.10.4.jar scala-compiler-2.10.4.jar
> avro-ipc-1.7.7.jar hadoop-mapreduce-client-common-2.2.0.jar jline-2.
> 12.jar scala-library-2.10.4.jar
> avro-mapred-1.7.7-hadoop2.jar hadoop-mapreduce-client-core-2.2.0.jar
> jna-3.4.0.jar scala-reflect-2.10.4.jar
> breeze-macros_2.10-0.11.2.jar hadoop-mapreduce-client-jobclient-2.2.
> 0.jar joda-time-2.5.jar scalacheck_2.10-1.11.3.jar
> breeze_2.10-0.11.2.jar hadoop-mapreduce-client-shuffle-2.2.0.jar
> jodd-core-3.5.2.jar scalap-2.10.0.jar
> calcite-avatica-1.2.0-incubating.jar hadoop-yarn-api-2.2.0.jar
> json-20080701.jar selenium-api-2.42.2.jar
> calcite-core-1.2.0-incubating.jar hadoop-yarn-client-2.2.0.jar
> json-20090211.jar selenium-chrome-driver-2.42.2.jar
> calcite-linq4j-1.2.0-incubating.jar hadoop-yarn-common-2.2.0.jar
> json4s-ast_2.10-3.2.10.jar selenium-firefox-driver-2.42.2.jar
> cglib-2.2.1-v20090111.jar hadoop-yarn-server-common-2.2.0.jar
> json4s-core_2.10-3.2.10.jar selenium-htmlunit-driver-2.42.2.jar
> cglib-nodep-2.1_3.jar hadoop-yarn-server-nodemanager-2.2.0.jar
> json4s-jackson_2.10-3.2.10.jar selenium-ie-driver-2.42.2.jar
> chill-java-0.5.0.jar hamcrest-core-1.1.jar jsr173_api-1.0.jar
> selenium-java-2.42.2.jar
> chill_2.10-0.5.0.jar hamcrest-core-1.3.jar jsr305-1.3.9.jar
> selenium-remote-driver-2.42.2.jar
> commons-beanutils-1.7.0.jar hamcrest-library-1.3.jar jsr305-2.0.
> 1.jar selenium-safari-driver-2.42.2.jar
> commons-beanutils-core-1.8.0.jar hive-exec-1.2.1.spark.jar jta-1.
> 1.jar selenium-support-2.42.2.jar
> commons-cli-1.2.jar hive-metastore-1.2.1.spark.jar jtransforms-2.4.
> 0.jar serializer-2.7.1.jar
> commons-codec-1.10.jar htmlunit-2.14.jar jul-to-slf4j-1.7.10.jar
> slf4j-api-1.7.10.jar
> commons-codec-1.4.jar htmlunit-core-js-2.14.jar junit-4.10.jar
> slf4j-log4j12-1.7.10.jar
> commons-codec-1.5.jar httpclient-4.3.2.jar 

Re: column identifiers in Spark SQL

2015-09-22 Thread Richard Hillegas

Thanks for that additional tip, Michael. Backticks fix the problem query in
which an identifier was transformed into a string literal. So this works
now...

  // now correctly resolves the unnormalized column id
  sqlContext.sql("""select `b` from test_data""").show

Any suggestion about how to escape an embedded double quote?

  // java.sql.SQLSyntaxErrorException: Syntax error: Encountered "\"" at
line 1, column 12.
  sqlContext.sql("""select `c"d` from test_data""").show

  // org.apache.spark.sql.AnalysisException: cannot resolve 'c\"d' given
input columns A, b, c"d; line 1 pos 7
  sqlContext.sql("""select `c\"d` from test_data""").show

Thanks,
-Rick

Michael Armbrust  wrote on 09/22/2015 01:16:12 PM:

> From: Michael Armbrust 
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: Dev 
> Date: 09/22/2015 01:16 PM
> Subject: Re: column identifiers in Spark SQL
>
> HiveQL uses `backticks` for quoted identifiers.
>
> On Tue, Sep 22, 2015 at 1:06 PM, Richard Hillegas 
wrote:
> Thanks for that tip, Michael. I think that my sqlContext was a raw
> SQLContext originally. I have rebuilt Spark like so...
>
>   sbt/sbt -Phive assembly/assembly
>
> Now I see that my sqlContext is a HiveContext. That fixes one of the
> queries. Now unnormalized column names work:
>
>   // ...unnormalized column names work now
>   sqlContext.sql("""select a from test_data""").show
>
> However, quoted identifiers are still treated as string literals:
>
>   // this still returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> And embedded quotes inside quoted identifiers are swallowed up:
>
>   // this now returns rows consisting of the string literal "cd"
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> Thanks,
> -Rick
>
> Michael Armbrust  wrote on 09/22/2015 10:58:36
AM:
>
> > From: Michael Armbrust 
> > To: Richard Hillegas/San Francisco/IBM@IBMUS
> > Cc: Dev 
> > Date: 09/22/2015 10:59 AM
> > Subject: Re: column identifiers in Spark SQL
>
> >
> > Are you using a SQLContext or a HiveContext?  The programming guide
> > suggests the latter, as the former is really only there because some
> > applications may have conflicts with Hive dependencies.  SQLContext
> > is case sensitive by default where as the HiveContext is not.  The
> > parser in HiveContext is also a lot better.
> >
> > On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas  > wrote:
> > I am puzzled by the behavior of column identifiers in Spark SQL. I
> > don't find any guidance in the "Spark SQL and DataFrame Guide" at
> > http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> > seeing odd behavior related to case-sensitivity and to delimited
> > (quoted) identifiers.
> >
> > Consider the following declaration of a table in the Derby
> > relational database, whose dialect hews closely to the SQL Standard:
> >
> >    create table app.t( a int, "b" int, "c""d" int );
> >
> > Now let's load that table into Spark like this:
> >
> >   import org.apache.spark.sql._
> >   import org.apache.spark.sql.types._
> >
> >   val df = sqlContext.read.format("jdbc").options(
> >     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> >     "dbtable" -> "app.t")).load()
> >   df.registerTempTable("test_data")
> >
> > The following query runs fine because the column name matches the
> > normalized form in which it is stored in the metadata catalogs of
> > the relational database:
> >
> >   // normalized column names are recognized
> >   sqlContext.sql(s"""select A from test_data""").show
> >
> > But the following query fails during name resolution. This puzzles
> > me because non-delimited identifiers are case-insensitive in the
> > ANSI/ISO Standard. They are also supposed to be case-insensitive in
> > HiveQL, at least according to section 2.3.1 of the
> > QuotedIdentifier.html webpage attached to https://issues.apache.org/
> > jira/browse/HIVE-6013:
> >
> >   // ...unnormalized column names raise this error:
> > org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> > input columns A, b, c"d;
> >   sqlContext.sql("""select a from test_data""").show
> >
> > Delimited (quoted) identifiers are treated as string literals.
> > Again, non-Standard behavior:
> >
> >   // this returns rows consisting of the string literal "b"
> >   sqlContext.sql("""select "b" from test_data""").show
> >
> > Embedded quotes in delimited identifiers won't even parse:
> >
> >   // embedded quotes raise this error: java.lang.RuntimeException:
> > [1.11] failure: ``union'' expected but "d" found
> >   sqlContext.sql("""select "c""d" from test_data""").show
> >
> > This behavior is non-Standard and it strikes me as hard to describe
> > to users concisely. Would the community support an effort to bring
> > the handling of 

Re: Derby version in Spark

2015-09-22 Thread Richard Hillegas

Thanks, Ted. I'm working on my master branch. The lib_managed/jars
directory has a lot of jarballs, including hadoop and hive. Maybe these
were faulted in when I built with the following command?

  sbt/sbt -Phive assembly/assembly

The Derby jars seem to be used in order to manage the metastore_db
database. Maybe my question should be directed to the Hive community?

Thanks,
-Rick

Here are the gory details:

bash-3.2$ ls lib_managed/jars
FastInfoset-1.2.12.jar  curator-test-2.4.0.jar
jersey-test-framework-grizzly2-1.9.jar
parquet-format-2.3.0-incubating.jar
JavaEWAH-0.3.2.jar  datanucleus-api-jdo-3.2.6.jar
jets3t-0.7.1.jar
parquet-generator-1.7.0.jar
ST4-4.0.4.jar   datanucleus-core-3.2.10.jar
jetty-continuation-8.1.14.v20131031.jar
parquet-hadoop-1.7.0.jar
activation-1.1.jar  datanucleus-rdbms-3.2.9.jar
jetty-http-8.1.14.v20131031.jar
parquet-hadoop-bundle-1.6.0.jar
akka-actor_2.10-2.3.11.jar  derby-10.10.1.1.jar
jetty-io-8.1.14.v20131031.jar   
parquet-jackson-1.7.0.jar
akka-remote_2.10-2.3.11.jar derby-10.10.2.0.jar
jetty-jndi-8.1.14.v20131031.jar platform-3.4.0.jar
akka-slf4j_2.10-2.3.11.jar
genjavadoc-plugin_2.10.4-0.9-spark0.jar
jetty-plus-8.1.14.v20131031.jar pmml-agent-1.1.15.jar
akka-testkit_2.10-2.3.11.jargroovy-all-2.1.6.jar
jetty-security-8.1.14.v20131031.jar pmml-model-1.1.15.jar
antlr-2.7.7.jar guava-11.0.2.jar
jetty-server-8.1.14.v20131031.jar   pmml-schema-1.1.15.jar
antlr-runtime-3.4.jar   guice-3.0.jar
jetty-servlet-8.1.14.v20131031.jar
postgresql-9.3-1102-jdbc41.jar
aopalliance-1.0.jar h2-1.4.183.jar
jetty-util-6.1.26.jar   py4j-0.8.2.1.jar
arpack_combined_all-0.1-javadoc.jar hadoop-annotations-2.2.0.jar
jetty-util-8.1.14.v20131031.jar pyrolite-4.4.jar
arpack_combined_all-0.1.jar hadoop-auth-2.2.0.jar
jetty-webapp-8.1.14.v20131031.jar   
quasiquotes_2.10-2.0.0.jar
asm-3.2.jar hadoop-client-2.2.0.jar
jetty-websocket-8.1.14.v20131031.jarreflectasm-1.07-shaded.jar
avro-1.7.4.jar  hadoop-common-2.2.0.jar
jetty-xml-8.1.14.v20131031.jar  sac-1.3.jar
avro-1.7.7.jar  hadoop-hdfs-2.2.0.jar
jline-0.9.94.jar
scala-compiler-2.10.0.jar
avro-ipc-1.7.7-tests.jar
hadoop-mapreduce-client-app-2.2.0.jar   jline-2.10.4.jar
scala-compiler-2.10.4.jar
avro-ipc-1.7.7.jar
hadoop-mapreduce-client-common-2.2.0.jarjline-2.12.jar
scala-library-2.10.4.jar
avro-mapred-1.7.7-hadoop2.jar
hadoop-mapreduce-client-core-2.2.0.jar  jna-3.4.0.jar
scala-reflect-2.10.4.jar
breeze-macros_2.10-0.11.2.jar
hadoop-mapreduce-client-jobclient-2.2.0.jar joda-time-2.5.jar
scalacheck_2.10-1.11.3.jar
breeze_2.10-0.11.2.jar
hadoop-mapreduce-client-shuffle-2.2.0.jar   jodd-core-3.5.2.jar
scalap-2.10.0.jar
calcite-avatica-1.2.0-incubating.jarhadoop-yarn-api-2.2.0.jar
json-20080701.jar   
selenium-api-2.42.2.jar
calcite-core-1.2.0-incubating.jar   hadoop-yarn-client-2.2.0.jar
json-20090211.jar   
selenium-chrome-driver-2.42.2.jar
calcite-linq4j-1.2.0-incubating.jar hadoop-yarn-common-2.2.0.jar
json4s-ast_2.10-3.2.10.jar
selenium-firefox-driver-2.42.2.jar
cglib-2.2.1-v20090111.jar
hadoop-yarn-server-common-2.2.0.jar json4s-core_2.10-3.2.10.jar
selenium-htmlunit-driver-2.42.2.jar
cglib-nodep-2.1_3.jar
hadoop-yarn-server-nodemanager-2.2.0.jarjson4s-jackson_2.10-3.2.10.jar
selenium-ie-driver-2.42.2.jar
chill-java-0.5.0.jarhamcrest-core-1.1.jar
jsr173_api-1.0.jar  selenium-java-2.42.2.jar
chill_2.10-0.5.0.jarhamcrest-core-1.3.jar
jsr305-1.3.9.jar
selenium-remote-driver-2.42.2.jar
commons-beanutils-1.7.0.jar hamcrest-library-1.3.jar
jsr305-2.0.1.jar
selenium-safari-driver-2.42.2.jar
commons-beanutils-core-1.8.0.jarhive-exec-1.2.1.spark.jar
jta-1.1.jar 
selenium-support-2.42.2.jar
commons-cli-1.2.jar hive-metastore-1.2.1.spark.jar
jtransforms-2.4.0.jar   
serializer-2.7.1.jar

Re: Derby version in Spark

2015-09-22 Thread Ted Yu
I cloned Hive 1.2 code base and saw:

10.10.2.0

So the version used by Spark is quite close to what Hive uses.

On Tue, Sep 22, 2015 at 3:29 PM, Ted Yu  wrote:

> I see.
> I use maven to build so I observe different contents under lib_managed
> directory.
>
> Here is snippet of dependency tree:
>
> [INFO] |  +- org.spark-project.hive:hive-metastore:jar:1.2.1.spark:compile
> [INFO] |  |  +- com.jolbox:bonecp:jar:0.8.0.RELEASE:compile
> [INFO] |  |  +- org.apache.derby:derby:jar:10.10.1.1:compile
>
> On Tue, Sep 22, 2015 at 3:21 PM, Richard Hillegas 
> wrote:
>
>> Thanks, Ted. I'm working on my master branch. The lib_managed/jars
>> directory has a lot of jarballs, including hadoop and hive. Maybe these
>> were faulted in when I built with the following command?
>>
>>   sbt/sbt -Phive assembly/assembly
>>
>> The Derby jars seem to be used in order to manage the metastore_db
>> database. Maybe my question should be directed to the Hive community?
>>
>> Thanks,
>> -Rick
>>
>> Here are the gory details:
>>
>> bash-3.2$ ls lib_managed/jars
>> FastInfoset-1.2.12.jar curator-test-2.4.0.jar
>> jersey-test-framework-grizzly2-1.9.jar parquet-format-2.3.0-incubating.jar
>> JavaEWAH-0.3.2.jar datanucleus-api-jdo-3.2.6.jar jets3t-0.7.1.jar
>> parquet-generator-1.7.0.jar
>> ST4-4.0.4.jar datanucleus-core-3.2.10.jar
>> jetty-continuation-8.1.14.v20131031.jar parquet-hadoop-1.7.0.jar
>> activation-1.1.jar datanucleus-rdbms-3.2.9.jar
>> jetty-http-8.1.14.v20131031.jar parquet-hadoop-bundle-1.6.0.jar
>> akka-actor_2.10-2.3.11.jar derby-10.10.1.1.jar
>> jetty-io-8.1.14.v20131031.jar parquet-jackson-1.7.0.jar
>> akka-remote_2.10-2.3.11.jar derby-10.10.2.0.jar
>> jetty-jndi-8.1.14.v20131031.jar platform-3.4.0.jar
>> akka-slf4j_2.10-2.3.11.jar genjavadoc-plugin_2.10.4-0.9-spark0.jar
>> jetty-plus-8.1.14.v20131031.jar pmml-agent-1.1.15.jar
>> akka-testkit_2.10-2.3.11.jar groovy-all-2.1.6.jar
>> jetty-security-8.1.14.v20131031.jar pmml-model-1.1.15.jar
>> antlr-2.7.7.jar guava-11.0.2.jar jetty-server-8.1.14.v20131031.jar
>> pmml-schema-1.1.15.jar
>> antlr-runtime-3.4.jar guice-3.0.jar jetty-servlet-8.1.14.v20131031.jar
>> postgresql-9.3-1102-jdbc41.jar
>> aopalliance-1.0.jar h2-1.4.183.jar jetty-util-6.1.26.jar py4j-0.8.2.1.jar
>> arpack_combined_all-0.1-javadoc.jar hadoop-annotations-2.2.0.jar
>> jetty-util-8.1.14.v20131031.jar pyrolite-4.4.jar
>> arpack_combined_all-0.1.jar hadoop-auth-2.2.0.jar
>> jetty-webapp-8.1.14.v20131031.jar quasiquotes_2.10-2.0.0.jar
>> asm-3.2.jar hadoop-client-2.2.0.jar jetty-websocket-8.1.14.v20131031.jar
>> reflectasm-1.07-shaded.jar
>> avro-1.7.4.jar hadoop-common-2.2.0.jar jetty-xml-8.1.14.v20131031.jar
>> sac-1.3.jar
>> avro-1.7.7.jar hadoop-hdfs-2.2.0.jar jline-0.9.94.jar
>> scala-compiler-2.10.0.jar
>> avro-ipc-1.7.7-tests.jar hadoop-mapreduce-client-app-2.2.0.jar
>> jline-2.10.4.jar scala-compiler-2.10.4.jar
>> avro-ipc-1.7.7.jar hadoop-mapreduce-client-common-2.2.0.jar
>> jline-2.12.jar scala-library-2.10.4.jar
>> avro-mapred-1.7.7-hadoop2.jar hadoop-mapreduce-client-core-2.2.0.jar
>> jna-3.4.0.jar scala-reflect-2.10.4.jar
>> breeze-macros_2.10-0.11.2.jar hadoop-mapreduce-client-jobclient-2.2.0.jar
>> joda-time-2.5.jar scalacheck_2.10-1.11.3.jar
>> breeze_2.10-0.11.2.jar hadoop-mapreduce-client-shuffle-2.2.0.jar
>> jodd-core-3.5.2.jar scalap-2.10.0.jar
>> calcite-avatica-1.2.0-incubating.jar hadoop-yarn-api-2.2.0.jar
>> json-20080701.jar selenium-api-2.42.2.jar
>> calcite-core-1.2.0-incubating.jar hadoop-yarn-client-2.2.0.jar
>> json-20090211.jar selenium-chrome-driver-2.42.2.jar
>> calcite-linq4j-1.2.0-incubating.jar hadoop-yarn-common-2.2.0.jar
>> json4s-ast_2.10-3.2.10.jar selenium-firefox-driver-2.42.2.jar
>> cglib-2.2.1-v20090111.jar hadoop-yarn-server-common-2.2.0.jar
>> json4s-core_2.10-3.2.10.jar selenium-htmlunit-driver-2.42.2.jar
>> cglib-nodep-2.1_3.jar hadoop-yarn-server-nodemanager-2.2.0.jar
>> json4s-jackson_2.10-3.2.10.jar selenium-ie-driver-2.42.2.jar
>> chill-java-0.5.0.jar hamcrest-core-1.1.jar jsr173_api-1.0.jar
>> selenium-java-2.42.2.jar
>> chill_2.10-0.5.0.jar hamcrest-core-1.3.jar jsr305-1.3.9.jar
>> selenium-remote-driver-2.42.2.jar
>> commons-beanutils-1.7.0.jar hamcrest-library-1.3.jar jsr305-2.0.1.jar
>> selenium-safari-driver-2.42.2.jar
>> commons-beanutils-core-1.8.0.jar hive-exec-1.2.1.spark.jar jta-1.1.jar
>> selenium-support-2.42.2.jar
>> commons-cli-1.2.jar hive-metastore-1.2.1.spark.jar jtransforms-2.4.0.jar
>> serializer-2.7.1.jar
>> commons-codec-1.10.jar htmlunit-2.14.jar jul-to-slf4j-1.7.10.jar
>> slf4j-api-1.7.10.jar
>> commons-codec-1.4.jar htmlunit-core-js-2.14.jar junit-4.10.jar
>> slf4j-log4j12-1.7.10.jar
>> commons-codec-1.5.jar httpclient-4.3.2.jar junit-dep-4.10.jar
>> snappy-0.2.jar
>> commons-codec-1.9.jar httpcore-4.3.1.jar junit-dep-4.8.2.jar
>> spire-macros_2.10-0.7.4.jar
>> commons-collections-3.2.1.jar httpmime-4.3.2.jar junit-interface-0.10.jar
>> spire_2.10-0.7.4.jar
>> 

Why Filter return a DataFrame object in DataFrame.scala?

2015-09-22 Thread qiuhai
Hi,
  Recently,I am reading source code(1.5 version) about sparksql .
    In DataFrame.scala, there is a funtion named filter in the 737 row 

*def filter(condition: Column): DataFrame = Filter(condition.expr,
logicalPlan)*

  The fucntion return  a Filter object,but it require a DataFrame object.


  thanks.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Filter-return-a-DataFrame-object-in-DataFrame-scala-tp14295.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Why there is no snapshots for 1.5 branch?

2015-09-22 Thread Bin Wang
Thanks. I've solved it. I modified pom.xml and add my own repo into it,
then use "mvn deploy".

Fengdong Yu 于2015年9月22日周二 下午2:08写道:

> basically, you can build snapshot by yourself.
>
> just clone the source code, and then 'mvn package/deploy/install…..’
>
>
> Azuryy Yu
>
>
>
> On Sep 22, 2015, at 13:36, Bin Wang  wrote:
>
> However I find some scripts in dev/audit-release, can I use them?
>
> Bin Wang 于2015年9月22日周二 下午1:34写道:
>
>> No, I mean push spark to my private repository. Spark don't have a
>> build.sbt as far as I see.
>>
>> Fengdong Yu 于2015年9月22日周二 下午1:29写道:
>>
>>> Do you mean you want to publish the artifact to your private repository?
>>>
>>> if so, please using ‘sbt publish’
>>>
>>> add the following in your build.sb:
>>>
>>> publishTo := {
>>>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/
>>> "
>>>   if (version.value.endsWith("SNAPSHOT"))
>>> Some("snapshots" at nexus + "content/repositories/snapshots")
>>>   else
>>> Some("releases"  at nexus + "content/repositories/releases")
>>>
>>> }
>>>
>>>
>>>
>>> On Sep 22, 2015, at 13:26, Bin Wang  wrote:
>>>
>>> My project is using sbt (or maven), which need to download dependency
>>> from a maven repo. I have my own private maven repo with nexus but I don't
>>> know how to push my own build to it, can you give me a hint?
>>>
>>> Mark Hamstra 于2015年9月22日周二 下午1:25写道:
>>>
 Yeah, whoever is maintaining the scripts and snapshot builds has fallen
 down on the job -- but there is nothing preventing you from checking out
 branch-1.5 and creating your own build, which is arguably a smarter thing
 to do anyway.  If I'm going to use a non-release build, then I want the
 full git commit history of exactly what is in that build readily available,
 not just somewhat arbitrary JARs.

 On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  wrote:

> But I cannot find 1.5.1-SNAPSHOT either at
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>
> Mark Hamstra 于2015年9月22日周二 下午12:55写道:
>
>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
>> The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1
>> release candidates and then the 1.5.1 release.
>>
>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:
>>
>>> I'd like to use some important bug fixes in 1.5 branch and I look
>>> for the apache maven host, but don't find any snapshot for 1.5 branch.
>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>
>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
>>> 1.5.X?
>>>
>>
>>

>>>
>


RowMatrix tallSkinnyQR - ERROR: Second call to constructor of static parser

2015-09-22 Thread Saif.A.Ellafi
Hi all,

wondering if any could make the new 1.5.0 stallSkinnyQR to work.
Follows my output, which is a big loop of the same errors until the shell dies.
I am curious since im failing to load any implementations from BLAS, LAPACK, 
etc.

scala> mat.tallSkinnyQR(false)
15/09/22 10:18:11 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemLAPACK
15/09/22 10:18:11 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefLAPACK
ERROR: Second call to constructor of static parser.  You must
ERROR: Second call to constructor of static parser.  You must
   either use ReInit() or set the JavaCC option STATIC to false
ERROR: Second call to constructor of static parser.  You must
   either use ReInit() or set the JavaCC option STATIC to false
   during parser generation.
ERROR: Second call to constructor of static parser.  You must
   either use ReInit() or set the JavaCC option STATIC to false
   during parser generation.
...
ERROR: Second call to constructor of static parser.  You must
   either use ReInit() or set the JavaCC option STATIC to false
   during parser generation.
15/09/22 10:18:11 ERROR Executor: Exception in task 6.0 in stage 3.0 (TID 31)
java.lang.Error
at org.j_paine.formatter.FormatParser.(FormatParser.java:353)
at org.j_paine.formatter.FormatParser.(FormatParser.java:346)
at org.j_paine.formatter.Parsers.(Formatter.java:1748)
at org.j_paine.formatter.Parsers.theParsers(Formatter.java:1739)
at org.j_paine.formatter.Format.(Formatter.java:177)



Open Issues for Contributors

2015-09-22 Thread Pedro Rodriguez
Where is the best place to look at open issues that haven't been
assigned/started for the next release? I am interested in working on
something, but I don't know what issues are higher priority for the next
release.

On a similar note, is there somewhere which outlines the overall goals for
the next release (be it 1.5.1 or 1.6) with some parent issues along with
smaller child issues to work on (like the built ins ticket from 1.5)?

Thanks,
-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience