How to convert a non-rdd data to rdd.

2014-10-11 Thread rapelly kartheek
Hi,

I am trying to write a String that is not an rdd to HDFS. This data is a
variable in Spark Scheduler code. None of the spark File operations are
working because my data is not rdd.

So, I tried using SparkContext.parallelize(data). But it throws error:

[error]
/home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265:
not found: value SparkContext
[error]  SparkContext.parallelize(result)
[error]  ^
[error] one error found

I realized that this data is part of the Scheduler. So, the Sparkcontext
would not have got created yet.

Any help in "writing scheduler variable data to HDFS" is appreciated!!

-Karthik


Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Davies Liu
Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915

On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas  wrote:
> It's true that it is an implementation detail, but it's a very important one
> to document because it has the possibility of changing results depending on
> when I use take or collect.  The issue I was running in to was when the
> executor had a different operating system than the driver, and I was using
> 'pipe' with a binary I compiled myself.  I needed to make sure I used the
> binary compiled for the operating system I expect it to run on.  So in cases
> where I was only interested in the first value, my code was breaking
> horribly on 1.0.2, but working fine on 1.1.
>
> My only suggestion would be to backport 'spark.localExecution.enabled' to
> the 1.0 line.  Thanks for all your help!
>
> Evan
>
> On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu  wrote:
>>
>> This is some kind of implementation details, so not documented :-(
>>
>> If you think this is a blocker for you, you could create a JIRA, maybe
>> it's could be fixed in 1.0.3+.
>>
>> Davies
>>
>> On Fri, Oct 10, 2014 at 5:11 PM, Evan  wrote:
>> > Thank you!  I was looking for a config variable to that end, but I was
>> > looking in Spark 1.0.2 documentation, since that was the version I had
>> > the
>> > problem with.  Is this behavior documented in 1.0.2's documentation?
>> >
>> > Evan
>> >
>> > On 10/09/2014 04:12 PM, Davies Liu wrote:
>> >>
>> >> When you call rdd.take() or rdd.first(), it may[1] executor the job
>> >> locally (in driver),
>> >> otherwise, all the jobs are executed in cluster.
>> >>
>> >> There is config called `spark.localExecution.enabled` (since 1.1+) to
>> >> change this,
>> >> it's not enabled by default, so all the functions will be executed in
>> >> cluster.
>> >> If you change set this to `true`, then you get the same behavior as
>> >> 1.0.
>> >>
>> >> [1] If it did not get enough items from the first partitions, it will
>> >> try multiple partitions
>> >> in a time, so they will be executed in cluster.
>> >>
>> >> On Thu, Oct 9, 2014 at 12:14 PM, esamanas 
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I am using pyspark and I'm trying to support both Spark 1.0.2 and
>> >>> 1.1.0
>> >>> with
>> >>> my app, which will run in yarn-client mode.  However, it appears when
>> >>> I
>> >>> use
>> >>> 'map' to run a python lambda function over an RDD, they appear to be
>> >>> run
>> >>> on
>> >>> different machines, and this is causing problems.
>> >>>
>> >>> In both cases, I am using a Hadoop cluster that runs linux on all of
>> >>> its
>> >>> nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.
>> >>> As
>> >>> a
>> >>> reproducer, here is my script:
>> >>>
>> >>> import platform
>> >>> print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]
>> >>>
>> >>> The answer in Spark 1.1.0:
>> >>> 'Linux'
>> >>>
>> >>> The answer in Spark 1.0.2:
>> >>> 'Darwin'
>> >>>
>> >>> In other experiments I changed the size of the list that gets
>> >>> parallelized,
>> >>> thinking maybe 1.0.2 just runs jobs on the driver node if they're
>> >>> small
>> >>> enough.  I got the same answer (with only 1 million numbers).
>> >>>
>> >>> This is a troubling difference.  I would expect all functions run on
>> >>> an
>> >>> RDD
>> >>> to be executed on my worker nodes in the Hadoop cluster, but this is
>> >>> clearly
>> >>> not the case for 1.0.2.  Why does this difference exist?  How can I
>> >>> accurately detect which jobs will run where?
>> >>>
>> >>> Thank you,
>> >>>
>> >>> Evan
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> -
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread Ilya Ganelin
Because of how closures work in Scala, there is no support for nested
map/rdd-based operations. Specifically, if you have

Context a {
Context b {

}
}

Operations within context b, when distributed across nodes, will no longer
have visibility of variables specific to context a because that context is
not distributed alongside that operation!

To get around this you need to serialize your operations. For example , run
a map job. Take the output of that and run a second map job to filter.
Another option is to run two separate map jobs and join their results. Keep
in mind that another useful technique is to execute the groupByKey routine
, particularly if you want to operate on a particular variable.
On Oct 11, 2014 11:09 AM, "arthur.hk.c...@gmail.com" <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1
> subquery in my Spark SQL, below are my sample table structures and a SQL
> that contains more than 1 subquery.
>
> Question 1:  How to load a HIVE table into Scala/Spark?
> Question 2:  How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY  in
> SCALA/SPARK?
> Question 3:  What is the DATEADD function in Scala/Spark? or how to
> implement  "DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '
> 2014-01-01')” in Spark or Hive?
> I can find HIVE (date_add(string startdate, int days)) but it is in days
> not MONTH / YEAR.
>
> Thanks.
>
> Regards
> Arthur
>
> ===
> My sample SQL with more than 1 subquery:
> SELECT S_NAME,
>COUNT(*) AS NUMWAIT
> FROM   SUPPLIER,
>LINEITEM L1,
>ORDERS
> WHERE  S_SUPPKEY = L1.L_SUPPKEY
>AND O_ORDERKEY = L1.L_ORDERKEY
>AND O_ORDERSTATUS = 'F'
>AND L1.L_RECEIPTDATE > L1.L_COMMITDATE
>AND EXISTS (SELECT *
>FROM   LINEITEM L2
>WHERE  L2.L_ORDERKEY = L1.L_ORDERKEY
>   AND L2.L_SUPPKEY <> L1.L_SUPPKEY)
>AND NOT EXISTS (SELECT *
>FROM   LINEITEM L3
>WHERE  L3.L_ORDERKEY = L1.L_ORDERKEY
>   AND L3.L_SUPPKEY <> L1.L_SUPPKEY
>   AND L3.L_RECEIPTDATE > L3.L_COMMITDATE)
> GROUP  BY S_NAME
> ORDER  BY NUMWAIT DESC, S_NAME
> limit 100;
>
>
> ===
> Supplier Table:
> CREATE TABLE IF NOT EXISTS SUPPLIER (
> S_SUPPKEY INTEGER PRIMARY KEY,
> S_NAME  CHAR(25),
> S_ADDRESS VARCHAR(40),
> S_NATIONKEY BIGINT NOT NULL,
> S_PHONE CHAR(15),
> S_ACCTBAL DECIMAL,
> S_COMMENT VARCHAR(101)
> )
>
> ===
> Order Table:
> CREATE TABLE IF NOT EXISTS ORDERS (
> O_ORDERKEY INTEGER PRIMARY KEY,
> O_CUSTKEY BIGINT NOT NULL,
> O_ORDERSTATUS   CHAR(1),
> O_TOTALPRICEDECIMAL,
> O_ORDERDATE CHAR(10),
> O_ORDERPRIORITY CHAR(15),
> O_CLERK CHAR(15),
> O_SHIPPRIORITY  INTEGER,
> O_COMMENT VARCHAR(79)
>
> ===
> LineItem Table:
> CREATE TABLE IF NOT EXISTS LINEITEM (
> L_ORDERKEY  BIGINT not null,
> L_PARTKEY   BIGINT,
> L_SUPPKEY   BIGINT,
> L_LINENUMBERINTEGER not null,
> L_QUANTITY  DECIMAL,
> L_EXTENDEDPRICE DECIMAL,
> L_DISCOUNT  DECIMAL,
> L_TAX   DECIMAL,
> L_SHIPDATE  CHAR(10),
> L_COMMITDATECHAR(10),
> L_RECEIPTDATE   CHAR(10),
> L_RETURNFLAGCHAR(1),
> L_LINESTATUSCHAR(1),
> L_SHIPINSTRUCT  CHAR(25),
> L_SHIPMODE  CHAR(10),
> L_COMMENT   VARCHAR(44),
> CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER )
> )
>
>


Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this!

Matei

On Oct 11, 2014, at 9:46 AM, Denny Lee  wrote:

> https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
> 
> If you're wondering how to connect Tableau to SparkSQL - here are the steps 
> to connect Tableau to SparkSQL.  
> 
> 
> 
> Enjoy!
> 



Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Evan Samanas
It's true that it is an implementation detail, but it's a very important
one to document because it has the possibility of changing results
depending on when I use take or collect.  The issue I was running in to was
when the executor had a different operating system than the driver, and I
was using 'pipe' with a binary I compiled myself.  I needed to make sure I
used the binary compiled for the operating system I expect it to run on.
So in cases where I was only interested in the first value, my code was
breaking horribly on 1.0.2, but working fine on 1.1.

My only suggestion would be to backport 'spark.localExecution.enabled' to
the 1.0 line.  Thanks for all your help!

Evan

On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu  wrote:

> This is some kind of implementation details, so not documented :-(
>
> If you think this is a blocker for you, you could create a JIRA, maybe
> it's could be fixed in 1.0.3+.
>
> Davies
>
> On Fri, Oct 10, 2014 at 5:11 PM, Evan  wrote:
> > Thank you!  I was looking for a config variable to that end, but I was
> > looking in Spark 1.0.2 documentation, since that was the version I had
> the
> > problem with.  Is this behavior documented in 1.0.2's documentation?
> >
> > Evan
> >
> > On 10/09/2014 04:12 PM, Davies Liu wrote:
> >>
> >> When you call rdd.take() or rdd.first(), it may[1] executor the job
> >> locally (in driver),
> >> otherwise, all the jobs are executed in cluster.
> >>
> >> There is config called `spark.localExecution.enabled` (since 1.1+) to
> >> change this,
> >> it's not enabled by default, so all the functions will be executed in
> >> cluster.
> >> If you change set this to `true`, then you get the same behavior as 1.0.
> >>
> >> [1] If it did not get enough items from the first partitions, it will
> >> try multiple partitions
> >> in a time, so they will be executed in cluster.
> >>
> >> On Thu, Oct 9, 2014 at 12:14 PM, esamanas 
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0
> >>> with
> >>> my app, which will run in yarn-client mode.  However, it appears when I
> >>> use
> >>> 'map' to run a python lambda function over an RDD, they appear to be
> run
> >>> on
> >>> different machines, and this is causing problems.
> >>>
> >>> In both cases, I am using a Hadoop cluster that runs linux on all of
> its
> >>> nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.
> As
> >>> a
> >>> reproducer, here is my script:
> >>>
> >>> import platform
> >>> print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]
> >>>
> >>> The answer in Spark 1.1.0:
> >>> 'Linux'
> >>>
> >>> The answer in Spark 1.0.2:
> >>> 'Darwin'
> >>>
> >>> In other experiments I changed the size of the list that gets
> >>> parallelized,
> >>> thinking maybe 1.0.2 just runs jobs on the driver node if they're small
> >>> enough.  I got the same answer (with only 1 million numbers).
> >>>
> >>> This is a troubling difference.  I would expect all functions run on an
> >>> RDD
> >>> to be executed on my worker nodes in the Hadoop cluster, but this is
> >>> clearly
> >>> not the case for 1.0.2.  Why does this difference exist?  How can I
> >>> accurately detect which jobs will run where?
> >>>
> >>> Thank you,
> >>>
> >>> Evan
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >
>


Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark !

I dont quite yet understand the semantics of RDDs in a streaming context
very well yet.

Are there any examples of how to implement CustomInputDStreams, with
corresponding Receivers in the docs ?

Ive hacked together a  custom stream, which is being opened and is
consuming data internally, however,  it is not empty RDDs, even though I am
calling store(...) mutliple times - however, Im relying on the default
implementation of store(...) which may be a mistake on my end.

By making my slide duration  small, I can make sure that indeed the job
finishes - however im not quite sure how we're supposed to shuttle data
from the ReceiverInputDStream into RDDs ?

Thanks!


RE: Spark SQL parser bug?

2014-10-11 Thread Mohammed Guller
I tried even without the “T” and it still returns an empty result:

scala> val sRdd = sqlContext.sql("select a from x where ts >= '2012-01-01 
00:00:00';")
sRdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[35] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [a#0]
ExistingRdd [a#0,ts#1], MapPartitionsRDD[37] at mapPartitions at 
basicOperators.scala:208

scala> sRdd.collect
res10: Array[org.apache.spark.sql.Row] = Array()


Mohammed

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Friday, October 10, 2014 10:14 PM
To: Mohammed Guller; user@spark.apache.org
Subject: Re: Spark SQL parser bug?


Hmm, there is a “T” in the timestamp string, which makes the string not a valid 
timestamp string representation. Internally Spark SQL uses 
java.sql.Timestamp.valueOf to cast a string to a timestamp.

On 10/11/14 2:08 AM, Mohammed Guller wrote:
scala> rdd.registerTempTable("x")

scala> val sRdd = sqlContext.sql("select a from x where ts >= 
'2012-01-01T00:00:00';")
sRdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[4] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [a#0]
ExistingRdd [a#0,ts#1], MapPartitionsRDD[6] at mapPartitions at 
basicOperators.scala:208

scala> sRdd.collect
res2: Array[org.apache.spark.sql.Row] = Array()
​


How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread arthur.hk.c...@gmail.com
Hi,

My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 
subquery in my Spark SQL, below are my sample table structures and a SQL that 
contains more than 1 subquery. 

Question 1:  How to load a HIVE table into Scala/Spark?
Question 2:  How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY  in SCALA/SPARK?
Question 3:  What is the DATEADD function in Scala/Spark? or how to implement  
"DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in Spark 
or Hive? 
I can find HIVE (date_add(string startdate, int days)) but it is in days not 
MONTH / YEAR.

Thanks.

Regards
Arthur

===
My sample SQL with more than 1 subquery: 
SELECT S_NAME, 
   COUNT(*) AS NUMWAIT 
FROM   SUPPLIER, 
   LINEITEM L1, 
   ORDERS
WHERE  S_SUPPKEY = L1.L_SUPPKEY 
   AND O_ORDERKEY = L1.L_ORDERKEY 
   AND O_ORDERSTATUS = 'F' 
   AND L1.L_RECEIPTDATE > L1.L_COMMITDATE 
   AND EXISTS (SELECT * 
   FROM   LINEITEM L2 
   WHERE  L2.L_ORDERKEY = L1.L_ORDERKEY 
  AND L2.L_SUPPKEY <> L1.L_SUPPKEY) 
   AND NOT EXISTS (SELECT * 
   FROM   LINEITEM L3 
   WHERE  L3.L_ORDERKEY = L1.L_ORDERKEY 
  AND L3.L_SUPPKEY <> L1.L_SUPPKEY 
  AND L3.L_RECEIPTDATE > L3.L_COMMITDATE) 
GROUP  BY S_NAME 
ORDER  BY NUMWAIT DESC, S_NAME
limit 100;  


===
Supplier Table:
CREATE TABLE IF NOT EXISTS SUPPLIER (
S_SUPPKEY   INTEGER PRIMARY KEY,
S_NAME  CHAR(25),
S_ADDRESS   VARCHAR(40),
S_NATIONKEY BIGINT NOT NULL, 
S_PHONE CHAR(15),
S_ACCTBAL   DECIMAL,
S_COMMENT   VARCHAR(101)
) 

===
Order Table:
CREATE TABLE IF NOT EXISTS ORDERS (
O_ORDERKEY  INTEGER PRIMARY KEY,
O_CUSTKEY   BIGINT NOT NULL, 
O_ORDERSTATUS   CHAR(1),
O_TOTALPRICEDECIMAL,
O_ORDERDATE CHAR(10),
O_ORDERPRIORITY CHAR(15),
O_CLERK CHAR(15),
O_SHIPPRIORITY  INTEGER,
O_COMMENT   VARCHAR(79)

===
LineItem Table:
CREATE TABLE IF NOT EXISTS LINEITEM (
L_ORDERKEY  BIGINT not null,
L_PARTKEY   BIGINT,
L_SUPPKEY   BIGINT,
L_LINENUMBERINTEGER not null,
L_QUANTITY  DECIMAL,
L_EXTENDEDPRICE DECIMAL,
L_DISCOUNT  DECIMAL,
L_TAX   DECIMAL,
L_SHIPDATE  CHAR(10),
L_COMMITDATECHAR(10),
L_RECEIPTDATE   CHAR(10),
L_RETURNFLAGCHAR(1),
L_LINESTATUSCHAR(1),
L_SHIPINSTRUCT  CHAR(25),
L_SHIPMODE  CHAR(10),
L_COMMENT   VARCHAR(44),
CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER )
)



Re: how to find the sources for spark-project

2014-10-11 Thread Ted Yu
I found this on computer where I built Spark:

$ jar tvf
/homes/hortonzy/.m2/repository//org/spark-project/hive/hive-exec/0.13.1/hive-exec-0.13.1.jar
| grep ParquetHiveSerDe
  2228 Mon Jun 02 12:50:16 UTC 2014
org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe$1.class
  1442 Mon Jun 02 12:50:16 UTC 2014
org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe$LAST_OPERATION.class
 12786 Mon Jun 02 12:50:16 UTC 2014
org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.class

Looking at
http://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/0.12.0 ,
I don't see the package you mentioned.

Maybe Patrick can give us more clue.

On Sat, Oct 11, 2014 at 7:32 AM, Sadhan Sood  wrote:

>
> -- Forwarded message --
> From: Sadhan Sood 
> Date: Sat, Oct 11, 2014 at 10:26 AM
> Subject: Re: how to find the sources for spark-project
> To: Stephen Boesch 
>
>
> Thanks, I still didn't find it - is it under some particular branch ? More
> specifically, I am looking to modify the file: ParquetHiveSerDe.java which
> is under this namespace:
>
> org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java and is
> being linked through this jar:
>
> 
>
> org.spark-project.hive
>
> hive-exec
>
> 0.12.0
>
> 
>
> It seems these are modified versions of the same jar in org.apache.hive:
>
> 
>
> org.apache.hive
>
> hive-exec
>
> 0.13.1
>
> 
>
> It'd be great if I can find the sources of this jar so I can modify it
> locally and rebuild jar. Not sure if I can rebuild spark with hive-exec
> artifact in org.apache.hive without breaking it.
>
>
>
> On Fri, Oct 10, 2014 at 7:28 PM, Stephen Boesch  wrote:
>
>> Git clone the spark projecthttps://github.com/apache/spark.git
>>  the hive related sources are under sql/hive/src/main/scala
>>
>> 2014-10-10 16:24 GMT-07:00 sadhan :
>>
>> We have our own customization on top of parquet serde that we've been
>>> using
>>> for hive. In order to make it work with spark-sql, we need to be able to
>>> re-build spark with this. It'll be much easier to rebuild spark with this
>>> patch once I can find the sources for org.spark-project.hive. Not sure
>>> where
>>> to find it ? This seems like the place where we need to put our patch.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-find-the-sources-for-spark-project-tp16187.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>


Fwd: how to find the sources for spark-project

2014-10-11 Thread Sadhan Sood
-- Forwarded message --
From: Sadhan Sood 
Date: Sat, Oct 11, 2014 at 10:26 AM
Subject: Re: how to find the sources for spark-project
To: Stephen Boesch 


Thanks, I still didn't find it - is it under some particular branch ? More
specifically, I am looking to modify the file: ParquetHiveSerDe.java which
is under this namespace:

org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java and is
being linked through this jar:



org.spark-project.hive

hive-exec

0.12.0



It seems these are modified versions of the same jar in org.apache.hive:



org.apache.hive

hive-exec

0.13.1



It'd be great if I can find the sources of this jar so I can modify it
locally and rebuild jar. Not sure if I can rebuild spark with hive-exec
artifact in org.apache.hive without breaking it.



On Fri, Oct 10, 2014 at 7:28 PM, Stephen Boesch  wrote:

> Git clone the spark projecthttps://github.com/apache/spark.git
>  the hive related sources are under sql/hive/src/main/scala
>
> 2014-10-10 16:24 GMT-07:00 sadhan :
>
> We have our own customization on top of parquet serde that we've been using
>> for hive. In order to make it work with spark-sql, we need to be able to
>> re-build spark with this. It'll be much easier to rebuild spark with this
>> patch once I can find the sources for org.spark-project.hive. Not sure
>> where
>> to find it ? This seems like the place where we need to put our patch.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-find-the-sources-for-spark-project-tp16187.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: return probability \ confidence instead of actual class

2014-10-11 Thread Adamantios Corais
Thank you Sean. I'll try to do it externally as you suggested, however, can
you please give me some hints on how to do that? In fact, where can I find
the 1.2 implementation you just mentioned? Thanks!




On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen  wrote:

> Plain old SVMs don't produce an estimate of class probabilities;
> predict_proba() does some additional work to estimate class
> probabilities from the SVM output. Spark does not implement this right
> now.
>
> Spark implements the equivalent of decision_function (the wTx + b bit)
> but does not expose it, and instead gives you predict(), which gives 0
> or 1 depending on whether the decision function exceeds the specified
> threshold.
>
> Yes you can roll your own just like you did to calculate the decision
> function from weights and intercept. I suppose it would be nice to
> expose it (do I hear a PR?) but it's not hard to do externally. You'll
> have to do this anyway if you're on anything earlier than 1.2.
>
> On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais
>  wrote:
> > ok let me rephrase my question once again. python-wise I am preferring
> > .predict_proba(X) instead of .decision_function(X) since it is easier
> for me
> > to interpret the results. as far as I can see, the latter functionality
> is
> > already implemented in Spark (well, in version 0.9.2 for example I have
> to
> > compute the dot product on my own otherwise I get 0 or 1) but the former
> is
> > not implemented (yet!). what should I do \ how to implement that one in
> > Spark as well? what are the required inputs here and how does the formula
> > look like?
> >
> > On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen  wrote:
> >>
> >> It looks like you are directly computing the SVM decision function in
> >> both cases:
> >>
> >> val predictions2 = m_users_double.map{point=>
> >>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
> >> }.cache()
> >>
> >> clf.decision_function(T)
> >>
> >> This does not give you +1/-1 in SVMs (well... not for most points,
> >> which will be outside the margin around the separating hyperplane).
> >>
> >> You can use the predict() function in SVMModel -- which will give you
> >> 0 or 1 (rather than +/- 1 but that's just differing convention)
> >> depending on the sign of the decision function. I don't know if this
> >> was in 0.9.
> >>
> >> At the moment I assume you saw small values of the decision function
> >> in scikit because of the radial basis function.
>


Re: RDD size in memory - Array[String] vs. case classes

2014-10-11 Thread Sean Owen
Yes of course. If your number is "123456", the this takes 4 bytes as
an int. But as a String in a 64-bit JVM you have an 8-byte reference,
4-byte object overhead, a char count of 4 bytes, and 6 2-byte chars.
Maybe more i'm not thinking of.

On Sat, Oct 11, 2014 at 6:29 AM, Liam Clarke-Hutchinson
 wrote:
> Hi all,
>
> I'm playing with Spark currently as a possible solution at work, and I've
> been recently working out a rough correlation between our input data size
> and RAM needed to cache an RDD that will be used multiple times in a job.
>
> As part of this I've been trialling different methods of representing the
> data, and I came across a result that surprised me, so I just wanted to
> check what I was seeing.
>
> So my data set is comprised of CSV with appx. 17 fields. When I load my
> sample data set locally, and cache it after splitting on the comma as an
> RDD[Array[String]], the Spark UI shows 8% of the RDD can be cached in
> available RAM.
>
> When I cache it as an RDD of a case class, 11% of the RDD is cacheable, so
> case classes are actually taking up less serialized space than an array.
>
> Is it because case class represents numbers as numbers, as opposed to the
> string array keeping them as strings?
>
> Cheers,
>
> Liam Clarke

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-sql failing for some tables in hive

2014-10-11 Thread Cheng Lian

Hmm, the details of the error didn't show in your mail...

On 10/10/14 12:25 AM, sadhan wrote:

We have a hive deployement on which we tried running spark-sql. When we try
to do describe  for some of the tables, spark-sql fails with
this:


while it works for some of the other tables. Confused and not sure what's
happening here. The same describe command works in hive. Whats confusing is
the describe command works for some of the other tables.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-failing-for-some-tables-in-hive-tp16046.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-11 Thread Cheng Lian
How was the table created? Would you mind to share related code? It 
seems that the underlying type of the |customer_id| field is actually 
long, but the schema says it’s integer, basically it’s a type mismatch 
error.


The first query succeeds because |SchemaRDD.count()| is translated to 
something equivalent to |SELECT COUNT(1) FROM ...| and doesn’t actually 
touch the field. While in the second case, Spark tries to materialize 
the whole underlying table into in-memory columnar format because you 
asked to cache the table, thus the type mismatch is detected.


On 10/10/14 8:28 PM, poiuytrez wrote:


Hi Cheng,

I am using Spark 1.1.0.
This is the stack trace:
14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID
2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long
cannot be cast to java.lang.Integer
 scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)

org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:146)

 org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:105)
 org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:92)

org.apache.spark.sql.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:72)

org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$super$appendFrom(ColumnBuilder.scala:88)

org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:57)

org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$super$appendFrom(ColumnBuilder.scala:88)

org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:76)

org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:88)

org.apache.spark.sql.columnar.InMemoryRelation$anonfun$1$anon$1.next(InMemoryColumnarTableScan.scala:65)

org.apache.spark.sql.columnar.InMemoryRelation$anonfun$1$anon$1.next(InMemoryColumnarTableScan.scala:50)

org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)

org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 java.lang.Thread.run(Thread.java:745)


This was also printed on the driver:

14/10/10 12:17:43 ERROR TaskSetManager: Task 120 in stage 7.0 failed 4
times; aborting job
14/10/10 12:17:43 INFO TaskSchedulerImpl: Cancelling stage 7
14/10/10 12:17:43 INFO TaskSchedulerImpl: Stage 7 was cancelled
14/10/10 12:17:43 INFO DAGScheduler: Failed to run collect at
SparkPlan.scala:85
Traceback (most recent call last):
   File "", line 1, in 
   File "/home/hadoop/spark-install/python/pyspark/sql.py", line 1606, in
count
 return self._jschema_rdd.count()
   File
"/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
   File
"/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o100.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
120 in stage 7.0 failed 4 times, most recent failure: Lost task 120.3 i

Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-11 Thread Sean Owen
I suspect you do not actually need to change the number of partitions
dynamically.

Do you just have groupings of data to process? use an RDD of (K,V) pairs
and things like groupByKey. If really have only 1000 unique keys, yes, only
half of the 2000 workers would get data in a phase that groups by
segmentation key I suppose. I'd then ask a) do you really need 2000 workers
or can that be tuned down? b) can you not divide up the day more than 1000
ways? c) does it matter so much in this phase, if you can exploit 2000 way
parallelism elsewhere in the pipeline?

On Sat, Oct 11, 2014 at 3:42 AM, nitinkak001  wrote:

> Thanks @category_theory, the post was of great help!!
>
> I had to learn a few thing before I could understand it completely.
> However, I am facing the issue of partitioning the data (using partitionBy)
> without providing a hardcoded value for number of partitions. The
> partitions need to be driven by data(segmentation key I am using) in my
> case.
>
> So my question is say if
>
> the number of partitions generated by my segmentation key = 1000
> the number given to the partitioner = 2000
>
> In this case, would there be 2000 partitions created(which will break the
> partition boundary of the segmentation key)? If so then sliding window will
> roll over multiple partitions and computation would generate wrong results.
>
> Thanks again for the response!!
>
> On Tue, Sep 30, 2014 at 11:51 AM, category_theory [via Apache Spark User
> List] <[hidden email] 
> > wrote:
>
>> Not sure if this is what you are after but its based on a moving average
>> within spark...  I was building an ARIMA model on top of spark and this
>> helped me out a lot:
>>
>> http://stackoverflow.com/questions/23402303/apache-spark-moving-average
>> ᐧ
>>
>>
>>
>>
>> *JIMMY MCERLAIN*
>>
>> DATA SCIENTIST (NERD)
>>
>> *. . . . . . . . . . . . . . . . . .*
>>
>>
>> *IF WE CAN’T DOUBLE YOUR SALES,*
>>
>>
>>
>> *ONE OF US IS IN THE WRONG BUSINESS.*
>>
>> *E*: [hidden email] 
>>
>>
>> *M*: *> target="_blank">510.303.7751*
>>
>> On Tue, Sep 30, 2014 at 8:19 AM, nitinkak001 <[hidden email]
>> > wrote:
>>
>>> Any ideas guys?
>>>
>>> Trying to find some information online. Not much luck so far.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> 
>>> For additional commands, e-mail: [hidden email]
>>> 
>>>
>>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15407.html
>>  To unsubscribe from Window comparison matching using the sliding window
>> functionality: feasibility, click here.
>> NAML
>> 
>>
>
>
> --
> View this message in context: Re: Window comparison matching using the
> sliding window functionality: feasibility
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>