date:20150306

Hm, why do you expect a factory method over a constructor? no, you
instantiate a SparkContext (if not working in the shell).

When you write your own program, you parse your own command line args.
--master yarn-client doesn't do anything unless you make it do so.
That is an arg to *Spark* programs.

In many cases these just end up setting a value on SparkConf. Not
always though. you would need to trace through programs like
SparkSubmit to see how to emulate the effect of some args.

Unless you really need to do this, you should not. Write an app that
you run with spark-submit.

On Fri, Mar 6, 2015 at 2:56 AM, Xi Shen davidshe...@gmail.com wrote:
 Thanks guys, this is very useful :)

 @Stephen, I know spark-shell will create a SC for me. But I don't understand
 why we still need to do new SparkContext(...) in our code. Shouldn't we
 get it from some where? e.g. SparkContext.get.

 Another question, if I want my spark code to run in YARN later, how should I
 create the SparkContext? Or I can just specify --marst yarn on command
 line?


 Thanks,
 David


 On Fri, Mar 6, 2015 at 12:38 PM Koen Vantomme koen.vanto...@gmail.com
 wrote:

 use the spark-shell command and the shell will open
 type :paste abd then paste your code, after control-d

 open spark-shell:
 sparks/bin
 ./spark-shell

 Verstuurd vanaf mijn iPhone

 Op 6-mrt.-2015 om 02:28 heeft fightf...@163.com fightf...@163.com het
 volgende geschreven:

 Hi,

 You can first establish a scala ide to develop and debug your spark
 program, lets say, intellij idea or eclipse.

 Thanks,
 Sun.

 
 fightf...@163.com


 From: Xi Shen
 Date: 2015-03-06 09:19
 To: user@spark.apache.org
 Subject: Spark code development practice
 Hi,

 I am new to Spark. I see every spark program has a main() function. I
 wonder if I can run the spark program directly, without using spark-submit.
 I think it will be easier for early development and debug.


 Thanks,
 David



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread fenghaixiong

you can read this document :
http://spark.apache.org/docs/latest/building-spark.html
this might can solve you question
and if you compile spark with maven you might need to set mave option like this 
 befor you start compile it
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m

thanks
On Fri, Mar 06, 2015 at 07:10:34PM +1100, Night Wolf wrote:
 Tried with that. No luck. Same error on abt-interface jar. I can see maven
 downloaded that jar into my .m2 cache
 
 On Friday, March 6, 2015, 鹰 980548...@qq.com wrote:
 
  try it with mvn  -DskipTests -Pscala-2.11 clean install package

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compile Spark with Maven Zinc Scala Plugin

Are you letting Spark download and run zinc for you? maybe that copy
is incomplete or corrupted. You can try removing the downloaded zinc
from build/ and try again.

Or run your own zinc.

On Fri, Mar 6, 2015 at 7:51 AM, Night Wolf nightwolf...@gmail.com wrote:
 Hey,

 Trying to build latest spark 1.3 with Maven using

 -DskipTests clean install package

 But I'm getting errors with zinc, in the logs I see;

 [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-network-common_2.11 ---


 ...

 [error] Required file not found: sbt-interface.jar
 [error] See zinc -help for information about locating necessary files


 Any ideas?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Integer column in schema RDD from parquet being considered as string

2015-03-06 Thread gtinside

Hi tsingfu ,

Thanks for your reply, I tried with other columns but the problem is same
with other Integer columns.

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Integer-column-in-schema-RDD-from-parquet-being-considered-as-string-tp21917p21950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu

Found this thread:
http://search-hadoop.com/m/JW1q5HMrge2

Cheers

On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote:

 This was discussed in the past and viewed as dangerous to enable. The
 biggest problem, by far, comes when you have a job that output M
 partitions, 'overwriting' a directory of data containing N  M old
 partitions. You suddenly have a mix of new and old data.

 It doesn't match Hadoop's semantics either, which won't let you do
 this. You can of course simply remove the output directory.

 On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote:
  Adding support for overwrite flag would make saveAsXXFile more user
 friendly.
 
  Cheers
 
 
 
  On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
 
  Hi folks,
 
  I found that RDD:saveXXFile has no overwrite flag which I think is very
 helpful. Is there any reason for this ?
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Nan Zhu

Actually, except setting spark.hadoop.validateOutputSpecs to false to disable 
output validation for the whole program  

Spark implementation uses a Dynamic Variable (object PairRDDFunctions) 
internally to disable it in a case-by-case manner

val disableOutputSpecValidation: DynamicVariable[Boolean] = new 
DynamicVariable[Boolean](false)

I’m not sure if there is enough amount of benefits to make it worth exposing 
this variable to the user…  

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote:

 Found this thread:
 http://search-hadoop.com/m/JW1q5HMrge2
  
 Cheers
  
 On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  This was discussed in the past and viewed as dangerous to enable. The
  biggest problem, by far, comes when you have a job that output M
  partitions, 'overwriting' a directory of data containing N  M old
  partitions. You suddenly have a mix of new and old data.
   
  It doesn't match Hadoop's semantics either, which won't let you do
  this. You can of course simply remove the output directory.
   
  On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com 
  (mailto:yuzhih...@gmail.com) wrote:
   Adding support for overwrite flag would make saveAsXXFile more user 
   friendly.
  
   Cheers
  
  
  
   On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com 
   (mailto:zjf...@gmail.com) wrote:
  
   Hi folks,
  
   I found that RDD:saveXXFile has no overwrite flag which I think is very 
   helpful. Is there any reason for this ?
  
  
  
   --
   Best Regards
  
   Jeff Zhang
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
   (mailto:user-unsubscr...@spark.apache.org)
   For additional commands, e-mail: user-h...@spark.apache.org 
   (mailto:user-h...@spark.apache.org)

Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-06 Thread Zsolt Tóth

Hi,

I submit spark jobs in yarn-cluster mode remotely from java code by calling
Client.submitApplication(). For some reason I want to use 1.3.0 jars on the
client side (e.g spark-yarn_2.10-1.3.0.jar) but I have
spark-assembly-1.2.1* on the cluster.
The problem is that the ApplicationMaster can't find the user application
jar (specified with --jar option). I think this is because of changes in
the classpath population in the Client class.
Is it possible to make this setup work without changing the codebase or the
jars?

Cheers,
Zsolt

Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Edmon Begoli

 Does Spark-SQL require installation of Hive for it to run correctly or not?

I could not tell from this statement:
https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

Thank you,
Edmon

Re: No overwrite flag for saveAsXXFile

This was discussed in the past and viewed as dangerous to enable. The
biggest problem, by far, comes when you have a job that output M
partitions, 'overwriting' a directory of data containing N  M old
partitions. You suddenly have a mix of new and old data.

It doesn't match Hadoop's semantics either, which won't let you do
this. You can of course simply remove the output directory.

On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote:
 Adding support for overwrite flag would make saveAsXXFile more user friendly.

 Cheers



 On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:

 Hi folks,

 I found that RDD:saveXXFile has no overwrite flag which I think is very 
 helpful. Is there any reason for this ?



 --
 Best Regards

 Jeff Zhang

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu

Adding support for overwrite flag would make saveAsXXFile more user friendly. 

Cheers



 On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
 
 Hi folks,
 
 I found that RDD:saveXXFile has no overwrite flag which I think is very 
 helpful. Is there any reason for this ?
 
 
 
 -- 
 Best Regards
 
 Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Data Frame types

2015-03-06 Thread Cesar Flores

The SchemaRDD supports the storage of user defined classes. However, in
order to do that, the user class needs to extends the UserDefinedType interface
(see for example VectorUDT in org.apache.spark.mllib.linalg).

My question is: Do the new Data Frame Structure (to be released in spark
1.3) will be able to handle user defined classes too? Do user classes will
need to extend they will need to define the same approach?


-- 
Cesar Flores

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai

Hi Edmon,

No, you do not need to install Hive to use Spark SQL.

Thanks,

Yin

On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon

[SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread James

Hello,

I want to execute a hql script through `spark-sql` command, my script
contains:

```
ALTER TABLE xxx
DROP PARTITION (date_key = ${hiveconf:CUR_DATE});
```

when I execute

```
spark-sql -f script.hql -hiveconf CUR_DATE=20150119
```

It throws an error like
```
cannot recognize input near '$' '{' 'hiveconf' in constant
```

I have try on hive and it works. Thus how could I pass a parameter like
date to a hql script?

Alcaid

Re: LBGFS optimizer performace

2015-03-06 Thread Gustavo Enrique Salazar Torres

Hi there:

Yeah, I came to that same conclusion after tuning spark sql shuffle
parameter. Also cut out some classes I was using to parse my dataset and
finally created schema only with the fields needed for my model (before
that I was creating it with 63 fields while I just needed 15).
So I came with this set of parameters:

--num-executors 200
--executor-memory 800M
--conf spark.executor.extraJavaOptions=-XX:+UseCompressedOops
-XX:+AggressiveOpts
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.storage.memoryFraction=0.3
--conf spark.rdd.compress=true
--conf spark.sql.shuffle.partitions=4000
--driver-memory 4G

Now I processed 270 GB in 35 minutes and no OOM errors.
I have one question though: Does Spark SQL handle skewed tables? I was
wondering about that since my data has that feature and maybe there is more
room for performance improvement.

Thanks again.

Gustavo

On Thu, Mar 5, 2015 at 6:45 PM, DB Tsai dbt...@dbtsai.com wrote:

PS, I will recommend you compress the data when you cache the RDD.
There will be some overhead in compression/decompression, and
serialization/deserialization, but it will help a lot for iterative
algorithms with ability to caching more data.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Tue, Mar 3, 2015 at 2:27 PM, Gustavo Enrique Salazar Torres
gsala...@ime.usp.br wrote:
Yeah, I can call count before that and it works. Also I was over caching
tables but I removed those. Now there is no caching but it gets really
slow
since it calculates my table RDD many times.
Also hacked the LBFGS code to pass the number of examples which I
calculated
outside in a Spark SQL query but just moved the location of the problem.

The query I'm running looks like this:

sSELECT $mappedFields, tableB.id as b_id FROM tableA LEFT JOIN tableB
ON
tableA.id=tableB.table_a_id WHERE tableA.status='' OR tableB.status=''

mappedFields contains a list of fields which I'm interested in. The
result
of that query goes through (including sampling) some transformations
before
being input to LBFGS.

My dataset has 180GB just for feature selection, I'm planning to use
450GB
to train the final model and I'm using 16 c3.2xlarge EC2 instances, that
means I have 240GB of RAM available.

Any suggestion? I'm starting to check the algorithm because I don't
understand why it needs to count the dataset.

Thanks

Gustavo

On Tue, Mar 3, 2015 at 6:08 PM, Joseph Bradley jos...@databricks.com
wrote:

Is that error actually occurring in LBFGS? It looks like it might be
happening before the data even gets to LBFGS. (Perhaps the outer join
you're trying to do is making the dataset size explode a bit.) Are you
able
to call count() (or any RDD action) on the data before you pass it to
LBFGS?

On Tue, Mar 3, 2015 at 8:55 AM, Gustavo Enrique Salazar Torres
gsala...@ime.usp.br wrote:

Just did with the same error.
I think the problem is the data.count() call in LBFGS because for
huge
datasets that's naive to do.
I was thinking to write my version of LBFGS but instead of doing
data.count() I will pass that parameter which I will calculate from a
Spark
SQL query.

I will let you know.

Thanks

On Tue, Mar 3, 2015 at 3:25 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

Can you try increasing your driver memory, reducing the executors and
increasing the executor memory?

Thanks
Best Regards

On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres
gsala...@ime.usp.br wrote:

Hi there:

I'm using LBFGS optimizer to train a logistic regression model. The
code I implemented follows the pattern showed in
https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but
training
data is obtained from a Spark SQL RDD.
The problem I'm having is that LBFGS tries to count the elements in
my
RDD and that results in a OOM exception since my dataset is huge.
I'm running on a AWS EMR cluster with 16 c3.2xlarge instances on
Hadoop
YARN. My dataset is about 150 GB but I sample (I take only 1% of the
data)
it in order to scale logistic regression.
The exception I'm getting is this:

15/03/03 04:21:44 WARN scheduler.TaskSetManager: Lost task 108.0 in
stage 2.0 (TID 7600, ip-10-155-20-71.ec2.internal):
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.init(String.java:203)
at
com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
at

com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
at

com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
at
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at

Re: Optimizing SQL Query

2015-03-06 Thread daniel queiroz

Dude,

please, attach the execution plan of the query and details about the
indexes.



2015-03-06 9:07 GMT-03:00 anu anamika.guo...@gmail.com:

 I have a query that's like:

 Could you help in providing me pointers as to how to start to optimize it
 w.r.t. spark sql:


 sqlContext.sql(

 SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

 FROM (
SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
 SDP_USAGE

FROM (
  SELECT *

  FROM date_d dd JOIN interval_f intf

  ON intf.DATE_WID = dd.WID

  WHERE intf.DATE_WID = 20141116 AND
 intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) =
 '2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME
AS STRING) = '2014-11-26
 00:00:00.000' AND MEAS_WID = 3

   ) test JOIN sdp_d sdp

ON test.SDP_WID = sdp.WID

WHERE sdp.UDC_ID = 'SP-1931201848'

GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

) dw

 GROUP BY dw.DAY_OF_WEEK, dw.HOUR)



 Currently the query takes 15 minutes execution time where interval_f table
 holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa

Do you have a reference paper to the implemented algorithm in TSQR.scala ?

On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can use
 out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao

spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers

currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
like combineByKey, reduceByKey and foldByKey rely on pushing the operation
map-side so that the data reduce-side is small. and groupByKey simply
requires that the values per key fit in memory.

but there are algorithms for which we would like to process all the values
per key reduce-side, even when they do not fit in memory. examples are
algorithms that need to process the values ordered, or algorithms that need
to emit all values again. basically this is what the original hadoop reduce
operation did so well: it allowed sorting of values (using secondary sort),
and it processed all values per key in a streaming fashion.

the library spark-sorted aims to bring these kind of operations back to
spark, by providing a way to process values with a user provided
Ordering[V] and a user provided streaming operation Iterator[V] =
Iterator[W]. it does not make the assumption that the values need to fit in
memory per key.

the basic idea is to rely on spark's sort-based shuffle to re-arrange the
data so that all values for a given key are placed consecutively within a
single partition, and then process them using a map-like operation.

you can find the project here:
https://github.com/tresata/spark-sorted

the project is in a very early stage. any feedback is very much appreciated.

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu

Since we already have spark.hadoop.validateOutputSpecs config, I think
there is not much need to expose disableOutputSpecValidation

Cheers

On Fri, Mar 6, 2015 at 7:34 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Actually, except setting spark.hadoop.validateOutputSpecs to false to
 disable output validation for the whole program

 Spark implementation uses a Dynamic Variable (object PairRDDFunctions)
 internally to disable it in a case-by-case manner

 val disableOutputSpecValidation: DynamicVariable[Boolean] = new 
 DynamicVariable[Boolean](false)


 I’m not sure if there is enough amount of benefits to make it worth exposing 
 this variable to the user…


 Best,


 --
 Nan Zhu
 http://codingcat.me

 On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote:

 Found this thread:
 http://search-hadoop.com/m/JW1q5HMrge2

 Cheers

 On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote:

 This was discussed in the past and viewed as dangerous to enable. The
 biggest problem, by far, comes when you have a job that output M
 partitions, 'overwriting' a directory of data containing N  M old
 partitions. You suddenly have a mix of new and old data.

 It doesn't match Hadoop's semantics either, which won't let you do
 this. You can of course simply remove the output directory.

 On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote:
  Adding support for overwrite flag would make saveAsXXFile more user
 friendly.
 
  Cheers
 
 
 
  On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
 
  Hi folks,
 
  I found that RDD:saveXXFile has no overwrite flag which I think is very
 helpful. Is there any reason for this ?
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: Data Frame types

2015-03-06 Thread Jaonary Rabarisoa

Hi Cesar,

Yes, you can define an UDT with the new DataFrame, the same way that
SchemaRDD did.

Jaonary

On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote:


 The SchemaRDD supports the storage of user defined classes. However, in
 order to do that, the user class needs to extends the UserDefinedType 
 interface
 (see for example VectorUDT in org.apache.spark.mllib.linalg).

 My question is: Do the new Data Frame Structure (to be released in spark
 1.3) will be able to handle user defined classes too? Do user classes will
 need to extend they will need to define the same approach?


 --
 Cesar Flores

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa

Hi Shivaram,

Thank you for the link. I'm trying to figure out how can I port this to
mllib. May you can help me to understand how pieces fit together.
Currently, in mllib there's different types of distributed matrix :

BlockMatrix, CoordinateMatrix, IndexedRowMatrix and RowMatrix. Which one
should correspond to RowPartitionedMatrix in ml-matrix ?



On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can use
 out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao

Re: Building Spark 1.3 for Scala 2.11 using Maven

-Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this profile.

Why are you running install package and not just install? Probably
doesn't matter.

This sounds like you are trying to only build core without building
everything else, which you can't do in general unless you already
built and installed these snapshot artifacts locally.

On Fri, Mar 6, 2015 at 12:46 AM, Night Wolf nightwolf...@gmail.com wrote:
 Hey guys,

 Trying to build Spark 1.3 for Scala 2.11.

 I'm running with the folllowng Maven command;

 -DskipTests -Dscala-2.11 clean install package


 Exception:

 [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve
 dependencies for project
 org.apache.spark:spark-core_2.10:jar:1.3.0-SNAPSHOT: The following artifacts
 could not be resolved:
 org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT,
 org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT: Failure to
 find org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT in
 http://repository.apache.org/snapshots was cached in the local repository,
 resolution will not be reattempted until the update interval of
 apache.snapshots has elapsed or updates are forced - [Help 1]


 I see these warnings in the log before this error:


 [INFO]
 [INFO]
 
 [INFO] Building Spark Project Core 1.3.0-SNAPSHOT
 [INFO]
 
 [WARNING] The POM for
 org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT is missing, no
 dependency information available
 [WARNING] The POM for
 org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT is missing,
 no dependency information available


 Any ideas?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Optimizing SQL Query

2015-03-06 Thread anu

I have a query that's like:

Could you help in providing me pointers as to how to start to optimize it
w.r.t. spark sql:


sqlContext.sql(

SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

FROM (
   SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
SDP_USAGE

   FROM (
 SELECT *

 FROM date_d dd JOIN interval_f intf

 ON intf.DATE_WID = dd.WID

 WHERE intf.DATE_WID = 20141116 AND
intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) =
'2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME 
   AS STRING) = '2014-11-26
00:00:00.000' AND MEAS_WID = 3

  ) test JOIN sdp_d sdp

   ON test.SDP_WID = sdp.WID

   WHERE sdp.UDC_ID = 'SP-1931201848'

   GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

   ) dw

GROUP BY dw.DAY_OF_WEEK, dw.HOUR)



Currently the query takes 15 minutes execution time where interval_f table
holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark-stream programme failed on yarn-client

2015-03-06 Thread fenghaixiong

Thanks ,you advise is usefull I just submit my job on my spark client which 
config with simple configure file so it failed 
when i run my job on service machine everything is okay 

 
On Fri, Mar 06, 2015 at 02:10:04PM +0530, Akhil Das wrote:
 Looks like an issue with your yarn setup, could you try doing a simple
 example with spark-shell?
 
 Start the spark shell as:
 
 $*MASTER=yarn-client bin/spark-shell*
 *spark-shell *sc.parallelize(1 to 1000).collect
 
 If that doesn't work, then make sure your yarn services are up and running
 and in your spark-env.sh you may set the corresponding configurations from
 the following:
 
 
 # Options read in YARN client mode
 # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
 # - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
 # - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
 # - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
 # - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512
 Mb)
 
 
 Thanks
 Best Regards
 
 On Fri, Mar 6, 2015 at 1:09 PM, fenghaixiong 980548...@qq.com wrote:
 
  Hi all,
 
 
  I'm try to write a spark stream programme so i read the spark
  online document ,according the document i write the programe like this :
 
  import org.apache.spark.SparkConf
  import org.apache.spark.streaming._
  import org.apache.spark.streaming.StreamingContext._
 
  object SparkStreamTest {
def main(args: Array[String]) {
  val conf = new SparkConf()
  val ssc = new StreamingContext(conf, Seconds(1))
  val lines = ssc.socketTextStream(args(0), args(1).toInt)
  val words = lines.flatMap(_.split( ))
  val pairs = words.map(word = (word, 1))
  val wordCounts = pairs.reduceByKey(_ + _)
  wordCounts.print()
  ssc.start() // Start the computation
  ssc.awaitTermination()  // Wait for the computation to terminate
}
 
  }
 
 
 
  for test i first start listen a port by this:
   nc -lk 
 
  and then i submit job by
  spark-submit  --master local[2] --class com.nd.hxf.SparkStreamTest
  spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
 
  everything is okay
 
 
  but when i run it on yarn by this :
  spark-submit  --master yarn-client --class com.nd.hxf.SparkStreamTest
  spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
 
  it wait for a longtime and repeat output somemessage a apart of the output
  is like this:
 
 
 
 
 
 
 
  15/03/06 15:30:24 INFO YarnClientSchedulerBackend: SchedulerBackend is
  ready for scheduling beginning after waiting
  maxRegisteredResourcesWaitingTime: 3(ms)
  15/03/06 15:30:24 INFO ReceiverTracker: ReceiverTracker started
  15/03/06 15:30:24 INFO ForEachDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO ShuffledDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO MappedDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO FlatMappedDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO SocketInputDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO SocketInputDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO SocketInputDStream: Storage level =
  StorageLevel(false, false, false, false, 1)
  15/03/06 15:30:24 INFO SocketInputDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO SocketInputDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO SocketInputDStream: Initialized and validated
  org.apache.spark.streaming.dstream.SocketInputDStream@b01c5f8
  15/03/06 15:30:24 INFO FlatMappedDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO FlatMappedDStream: Storage level =
  StorageLevel(false, false, false, false, 1)
  15/03/06 15:30:24 INFO FlatMappedDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO FlatMappedDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO FlatMappedDStream: Initialized and validated
  org.apache.spark.streaming.dstream.FlatMappedDStream@6bd47453
  15/03/06 15:30:24 INFO MappedDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO MappedDStream: Storage level = StorageLevel(false,
  false, false, false, 1)
  15/03/06 15:30:24 INFO MappedDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO MappedDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO MappedDStream: Initialized and validated
  org.apache.spark.streaming.dstream.MappedDStream@941451f
  15/03/06 15:30:24 INFO ShuffledDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO ShuffledDStream: Storage level =
  StorageLevel(false, false, false, false, 1)
  15/03/06 15:30:24 INFO ShuffledDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO ShuffledDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO ShuffledDStream: Initialized and validated
  org.apache.spark.streaming.dstream.ShuffledDStream@42eba6ee
  15/03/06 15:30:24 INFO ForEachDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO ForEachDStream: Storage level = StorageLevel(false,
  false, false, false, 1)
  15/03/06

No overwrite flag for saveAsXXFile

2015-03-06 Thread Jeff Zhang

Hi folks,

I found that RDD:saveXXFile has no overwrite flag which I think is very
helpful. Is there any reason for this ?



-- 
Best Regards

Jeff Zhang

Re: Spark-SQL and Hive - is Hive required?

Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext.  From earlier in that doc:

In addition to the basic SQLContext, you can also create a HiveContext,
 which provides a superset of the functionality provided by the basic
 SQLContext. Additional features include the ability to write queries using
 the more complete HiveQL parser, access to HiveUDFs, and the ability to
 read data from Hive tables. To use a HiveContext, *you do not need to
 have an existing Hive setup*, and all of the data sources available to a
 SQLContext are still available. HiveContext is only packaged separately to
 avoid including all of Hive’s dependencies in the default Spark build. If
 these dependencies are not a problem for your application then using
 HiveContext is recommended for the 1.2 release of Spark. Future releases
 will focus on bringing SQLContext up to feature parity with a HiveContext.


On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote:

 Hi Edmon,

 No, you do not need to install Hive to use Spark SQL.

 Thanks,

 Yin

 On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist

First, thanks to everyone for their assistance and recommendations.

@Marcelo

I applied the patch that you recommended and am now able to get into the
shell, thank you worked great after I realized that the pom was pointing to
the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

@Zhan

Need to apply this patch next.  I tried to start the spark-thriftserver but
and it starts, then fails with like this:  I have the entries in my
spark-default.conf, but not the patch applied.

./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m
--hiveconf hive.server2.thrift.port=10001

5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at
http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO
impl.TimelineClientImpl: Timeline service address:
http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06
12:34:18 INFO client.RMProxy: Connecting to ResourceManager at
hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO
yarn.Client: Requesting a new application from cluster with 4
NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our
application has not requested more than the maximum memory capability
of the cluster (8192 MB per container)15/03/06 12:34:18 INFO
yarn.Client: Will allocate AM container, with 896 MB memory including
384 MB overhead15/03/06 12:34:18 INFO yarn.Client: Setting up
container launch context for our AM15/03/06 12:34:18 INFO yarn.Client:
Preparing resources for our AM container15/03/06 12:34:19 WARN
shortcircuit.DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.15/03/06
12:34:19 INFO yarn.Client: Uploading resource
file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar
- 
hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06
12:34:21 INFO yarn.Client: Setting up the launch environment for our
AM container15/03/06 12:34:21 INFO spark.SecurityManager: Changing
view acls to: root15/03/06 12:34:21 INFO spark.SecurityManager:
Changing modify acls to: root15/03/06 12:34:21 INFO
spark.SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(root); users with
modify permissions: Set(root)15/03/06 12:34:21 INFO yarn.Client:
Submitting application 18 to ResourceManager15/03/06 12:34:21 INFO
impl.YarnClientImpl: Submitted application
application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client:
Application report for application_1425078697953_0018 (state:
ACCEPTED)15/03/06 12:34:22 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL:
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root15/03/06 12:34:23 INFO yarn.Client: Application report
for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:24
INFO yarn.Client: Application report for
application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:25 INFO
yarn.Client: Application report for application_1425078697953_0018
(state: ACCEPTED)15/03/06 12:34:26 INFO yarn.Client: Application
report for application_1425078697953_0018 (state: ACCEPTED)15/03/06
12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster
registered as 
Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763]15/03/06
12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter.
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
Map(PROXY_HOSTS - hadoopdev02.opsdatastore.com, PROXY_URI_BASES -
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018),
/proxy/application_1425078697953_001815/03/06 12:34:27 INFO
ui.JettyUtils: Adding filter:
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter15/03/06
12:34:27 INFO yarn.Client: Application report for
application_1425078697953_0018 (state: RUNNING)15/03/06 12:34:27 INFO
yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: hadoopdev08.opsdatastore.com
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL:
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root15/03/06 12:34:27 INFO
cluster.YarnClientSchedulerBackend: Application
application_1425078697953_0018 has started running.15/03/06 12:34:28
INFO netty.NettyBlockTransferService: Server created on 4612415/03/06
12:34:28 INFO storage.BlockManagerMaster: Trying to register
BlockManager15/03/06 12:34:28 INFO storage.BlockManagerMasterActor:
Registering block manager hadoopdev01.opsdatastore.com:46124 with
265.4 MB RAM, BlockManagerId(driver, hadoopdev01.opsdatastore.com,
46124)15/03/06 12:34:28 INFO storage.BlockManagerMaster: Registered

SparkSQL supports hive insert overwrite directory?

2015-03-06 Thread ogoh

Hello,
I am using Spark 1.2.1 along with Hive 0.13.1.
I run some hive queries by using beeline and Thriftserver. 
Queries I tested so far worked well except the followings:
I want to export the query output into a file at either HDFS or local fs
(ideally local fs).
There are not yet supported?
The spark github has already unit tests using insert overwrite directory
in
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/insert_overwrite_local_directory_1.q.

$insert overwrite directory 'hdfs directory name' select * from temptable; 
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
'/user/ogoh/table'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_DIR
'/user/bob/table'

$insert overwrite local directory 'hdfs directory name' select * from
temptable; ;
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
/user/bob/table
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_LOCAL_DIR
/user/ogoh/table

Thanks,
Okehee



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-supports-hive-insert-overwrite-directory-tp21951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed

Hi,
I am filtering first DStream with the value in second DStream. I also want to 
keep the value of second Dstream. I have done the following and having problem 
with returning new RDD:
val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: 
RDD[(String,String)], rdd2 : RDD[Int]) = {                                     
                                   var first =  ; var second =  ; var third 
= 0                                                                          if 
(rdd2.first=3)                                                                 
               {                                                                
                        first = rdd1.map(_._1).first                            
                                                            second = 
rdd1.map(_._2).first                                                            
                            third = rdd2.first                                  
                                              }                                 
                                       RDD[(first,second,third)]                
                                                               })
ERROR/home/hduser/Projects/scalaad/src/main/scala/eeg/anomd/StreamAnomalyDetector.scala:119:
 error: not found: value RDD[ERROR]  RDD[(first,second,third)] 
I am imported the import org.apache.spark.rdd.RDD

Regards,Laeeq

Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee

Hi Rares,

If you dig into the descriptions for the two jobs, it will probably return
something like:

Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...

Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...

The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Basically, line 428 refers to
val initialCount = this.count()

And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()

Basically, the first job is getting the count so you can do the second job
which is to generate the samples.

HTH!
Denny




On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica rvern...@gmail.com wrote:

 Hello,

 I am using takeSample from the Scala Spark 1.2.1 shell:

 scala sc.textFile(README.md).takeSample(false, 3)


 and I notice that two jobs are generated on the Spark Jobs page:

 Job Id Description
 1 takeSample at console:13
 0  takeSample at console:13


 Any ideas why the two jobs are needed?

 Thanks!
 Rares

Re: Visualize Spark Job

2015-03-06 Thread Phuoc Do

I have this PR submitted. You can merge it and try.

https://github.com/apache/spark/pull/2077

On Thu, Jan 15, 2015 at 12:50 AM, Kuromatsu, Nobuyuki 
n.kuroma...@jp.fujitsu.com wrote:

 Hi

 I want to visualize tasks and stages in order to analyze spark jobs.
 I know necessary metrics is written in spark.eventLog.dir.

 Does anyone know the tool like swimlanes in Tez?

 Regards,

 Nobuyuki Kuromatsu

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Phuoc Do
https://vida.io/dnprock

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash)

Thanks.

Zhan Zhang

On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote:

 Hello, 
 
 I want to execute a hql script through `spark-sql` command, my script 
 contains: 
 
 ```
 ALTER TABLE xxx 
 DROP PARTITION (date_key = ${hiveconf:CUR_DATE});
 ```
 
 when I execute 
 
 ```
 spark-sql -f script.hql -hiveconf CUR_DATE=20150119
 ```
 
 It throws an error like 
 ```
 cannot recognize input near '$' '{' 'hiveconf' in constant
 ```
 
 I have try on hive and it works. Thus how could I pass a parameter like date 
 to a hql script?
 
 Alcaid


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

takeSample triggers 2 jobs

2015-03-06 Thread Rares Vernica

Hello,

I am using takeSample from the Scala Spark 1.2.1 shell:

scala sc.textFile(README.md).takeSample(false, 3)


and I notice that two jobs are generated on the Spark Jobs page:

Job Id Description
1 takeSample at console:13
0  takeSample at console:13


Any ideas why the two jobs are needed?

Thanks!
Rares

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread sandeep vura

Hi ,

For creating a Hive table do i need to add hive-site.xml in spark/conf
directory.

On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com
wrote:

Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext. From earlier in that doc:

In addition to the basic SQLContext, you can also create a HiveContext,
which provides a superset of the functionality provided by the basic
SQLContext. Additional features include the ability to write queries using
the more complete HiveQL parser, access to HiveUDFs, and the ability to
read data from Hive tables. To use a HiveContext, *you do not need to
have an existing Hive setup*, and all of the data sources available to a
SQLContext are still available. HiveContext is only packaged separately to
avoid including all of Hive’s dependencies in the default Spark build. If
these dependencies are not a problem for your application then using
HiveContext is recommended for the 1.2 release of Spark. Future releases
will focus on bringing SQLContext up to feature parity with a HiveContext.

On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote:

Hi Edmon,

No, you do not need to install Hive to use Spark SQL.

Thanks,

Yin

On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

Does Spark-SQL require installation of Hive for it to run correctly or
not?

I could not tell from this statement:

https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

Thank you,
Edmon

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Shivaram Venkataraman

Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a
good reference

Shivaram
On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 Do you have a reference paper to the implemented algorithm in TSQR.scala ?

 On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can
 use out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

Hi Todd,

Looks like the thrift server can connect to metastore, but something wrong in 
the executors. You can try to get the log with yarn logs -applicationID xxx” 
to check why it failed. If there is no log (master or executor is not started 
at all), you can go to the RM webpage, click the link to see why the shell 
failed in the first place.

Thanks.

Zhan Zhang

On Mar 6, 2015, at 9:59 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:

First, thanks to everyone for their assistance and recommendations.

@Marcelo

I applied the patch that you recommended and am now able to get into the shell, 
thank you worked great after I realized that the pom was pointing to the 
1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

@Zhan

Need to apply this patch next.  I tried to start the spark-thriftserver but and 
it starts, then fails with like this:  I have the entries in my 
spark-default.conf, but not the patch applied.


./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m --hiveconf 
hive.server2.thrift.port=10001

5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at 
http://hadoopdev01http://hadoopdev01/.opsdatastore.com:4040
15/03/06 12:34:18 INFO impl.TimelineClientImpl: Timeline service address: 
http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8188/ws/v1/timeline/
15/03/06 12:34:18 INFO client.RMProxy: Connecting to ResourceManager at 
hadoopdev02.opsdatastore.com/192.168.15.154:8050
15/03/06 12:34:18 INFO yarn.Client: Requesting a new application from cluster 
with 4 NodeManagers
15/03/06 12:34:18 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (8192 MB per container)
15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, with 896 MB 
memory including 384 MB overhead
15/03/06 12:34:18 INFO yarn.Client: Setting up container launch context for our 
AM
15/03/06 12:34:18 INFO yarn.Client: Preparing resources for our AM container
15/03/06 12:34:19 WARN shortcircuit.DomainSocketFactory: The short-circuit 
local reads feature cannot be used because libhadoop cannot be loaded.
15/03/06 12:34:19 INFO yarn.Client: Uploading resource 
file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar 
- 
hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar
15/03/06 12:34:21 INFO yarn.Client: Setting up the launch environment for our 
AM container
15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: root
15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: root
15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/03/06 12:34:21 INFO yarn.Client: Submitting application 18 to ResourceManager
15/03/06 12:34:21 INFO impl.YarnClientImpl: Submitted application 
application_1425078697953_0018
15/03/06 12:34:22 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:22 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL: 
http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root
15/03/06 12:34:23 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:24 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:25 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:26 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster 
registered as 
Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763]
15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS - 
hadoopdev02.opsdatastore.com, PROXY_URI_BASES - 
http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8088/proxy/application_1425078697953_0018),
 /proxy/application_1425078697953_0018
15/03/06 12:34:27 INFO ui.JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/03/06 12:34:27 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: RUNNING)
15/03/06 12:34:27 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: hadoopdev08.opsdatastore.com
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL:

Re: Spark-SQL and Hive - is Hive required?

On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura sandeepv...@gmail.com wrote:

 Can i get document how to create that setup .i mean i need hive
 integration on spark


http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist

Hi Zhan,

I applied the patch you recommended,
https://github.com/apache/spark/pull/3409, it it now works. It was failing
with this:

Exception message:
/hadoop/yarn/local/usercache/root/appcache/application_1425078697953_0020/container_1425078697953_0020_01_02/launch_container.sh:
line 14:
$PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/
*${hdp.version}*/hadoop/lib/hadoop-lzo-0.6.0.*${hdp.version}*.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*:
*bad substitution*

While the spark-default.conf has these defined:

spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041


without the patch *${hdp.version} * was not being substituted.

Thanks for pointing me to that patch, appreciate it.

-Todd

On Fri, Mar 6, 2015 at 1:12 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Hi Todd,

  Looks like the thrift server can connect to metastore, but something
 wrong in the executors. You can try to get the log with yarn logs
 -applicationID xxx” to check why it failed. If there is no log (master or
 executor is not started at all), you can go to the RM webpage, click the
 link to see why the shell failed in the first place.

  Thanks.

  Zhan Zhang

  On Mar 6, 2015, at 9:59 AM, Todd Nist tsind...@gmail.com wrote:

  First, thanks to everyone for their assistance and recommendations.

  @Marcelo

  I applied the patch that you recommended and am now able to get into the
 shell, thank you worked great after I realized that the pom was pointing to
 the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

  @Zhan

  Need to apply this patch next.  I tried to start the spark-thriftserver
 but and it starts, then fails with like this:  I have the entries in my
 spark-default.conf, but not the patch applied.

   ./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m 
 --hiveconf hive.server2.thrift.port=10001

  5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at 
 http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO 
 impl.TimelineClientImpl: Timeline service address: 
 http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06 12:34:18 
 INFO client.RMProxy: Connecting to ResourceManager at 
 hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO 
 yarn.Client: Requesting a new application from cluster with 4 
 NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our application has 
 not requested more than the maximum memory capability of the cluster (8192 MB 
 per container)15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, 
 with 896 MB memory including 384 MB overhead15/03/06 12:34:18 INFO 
 yarn.Client: Setting up container launch context for our AM15/03/06 12:34:18 
 INFO yarn.Client: Preparing resources for our AM container15/03/06 12:34:19 
 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature 
 cannot be used because libhadoop cannot be loaded.15/03/06 12:34:19 INFO 
 yarn.Client: Uploading resource 
 file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar 
 - 
 hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06
  12:34:21 INFO yarn.Client: Setting up the launch environment for our AM 
 container15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: 
 root15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: 
 root15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(root); users with modify permissions: Set(root)15/03/06 12:34:21 INFO 
 yarn.Client: Submitting application 18 to ResourceManager15/03/06 12:34:21 
 INFO impl.YarnClientImpl: Submitted application 
 application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client: Application 
 report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:22 
 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: N/A
  ApplicationMaster RPC port: -1
  queue: default
  start time: 1425663261755
  final status: UNDEFINED
  tracking URL: 
 http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
  user: root15/03/06 12:34:23 INFO yarn.Client: Application report for

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

Sorry. Misunderstanding. Looks like it already worked. If you still met some 
hdp.version problem, you can try it :)

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:40 AM, Zhan Zhang 
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:

You are using 1.2.1 right? If so, please add java-opts  in conf directory and 
give it a try.

[root@c6401 conf]# more java-opts
  -Dhdp.version=2.2.2.0-2041

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:35 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:

 -Dhdp.version=2.2.0.0-2041

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

You are using 1.2.1 right? If so, please add java-opts  in conf directory and 
give it a try.

[root@c6401 conf]# more java-opts
  -Dhdp.version=2.2.2.0-2041

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:35 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:

 -Dhdp.version=2.2.0.0-2041

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist

Working great now, after applying that patch; thanks again.

On Fri, Mar 6, 2015 at 2:42 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Sorry. Misunderstanding. Looks like it already worked. If you still met
 some hdp.version problem, you can try it :)

  Thanks.

  Zhan Zhang

  On Mar 6, 2015, at 11:40 AM, Zhan Zhang zzh...@hortonworks.com wrote:

  You are using 1.2.1 right? If so, please add java-opts  in conf
 directory and give it a try.

  [root@c6401 conf]# more java-opts
   -Dhdp.version=2.2.2.0-2041

  Thanks.

  Zhan Zhang

  On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.com wrote:

  -Dhdp.version=2.2.0.0-2041

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz

Hi Koert,

Would you like to register this on spark-packages.org?

Burak

On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote:

 currently spark provides many excellent algorithms for operations per key
 as long as the data send to the reducers per key fits in memory. operations
 like combineByKey, reduceByKey and foldByKey rely on pushing the operation
 map-side so that the data reduce-side is small. and groupByKey simply
 requires that the values per key fit in memory.

 but there are algorithms for which we would like to process all the values
 per key reduce-side, even when they do not fit in memory. examples are
 algorithms that need to process the values ordered, or algorithms that need
 to emit all values again. basically this is what the original hadoop reduce
 operation did so well: it allowed sorting of values (using secondary sort),
 and it processed all values per key in a streaming fashion.

 the library spark-sorted aims to bring these kind of operations back to
 spark, by providing a way to process values with a user provided
 Ordering[V] and a user provided streaming operation Iterator[V] =
 Iterator[W]. it does not make the assumption that the values need to fit in
 memory per key.

 the basic idea is to rely on spark's sort-based shuffle to re-arrange the
 data so that all values for a given key are placed consecutively within a
 single partition, and then process them using a map-like operation.

 you can find the project here:
 https://github.com/tresata/spark-sorted

 the project is in a very early stage. any feedback is very much
 appreciated.

Re: Spark-SQL and Hive - is Hive required?

Only if you want to configure the connection to an existing hive metastore.

On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura sandeepv...@gmail.com wrote:

Hi ,

For creating a Hive table do i need to add hive-site.xml in spark/conf
directory.

On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com
wrote:

Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext. From earlier in that doc:

In addition to the basic SQLContext, you can also create a HiveContext,
which provides a superset of the functionality provided by the basic
SQLContext. Additional features include the ability to write queries using
the more complete HiveQL parser, access to HiveUDFs, and the ability to
read data from Hive tables. To use a HiveContext, *you do not need to
have an existing Hive setup*, and all of the data sources available to
a SQLContext are still available. HiveContext is only packaged separately
to avoid including all of Hive’s dependencies in the default Spark build.
If these dependencies are not a problem for your application then using
HiveContext is recommended for the 1.2 release of Spark. Future releases
will focus on bringing SQLContext up to feature parity with a HiveContext.

On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote:

Hi Edmon,

No, you do not need to install Hive to use Spark SQL.

Thanks,

Yin

On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

Does Spark-SQL require installation of Hive for it to run correctly or
not?

I could not tell from this statement:

https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

Thank you,
Edmon

Re: Data Frame types

No, the UDT API is not a public API as we have not stabilized the
implementation.  For this reason its only accessible to projects inside of
Spark.

On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 Hi Cesar,

 Yes, you can define an UDT with the new DataFrame, the same way that
 SchemaRDD did.

 Jaonary

 On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote:


 The SchemaRDD supports the storage of user defined classes. However, in
 order to do that, the user class needs to extends the UserDefinedType 
 interface
 (see for example VectorUDT in org.apache.spark.mllib.linalg).

 My question is: Do the new Data Frame Structure (to be released in spark
 1.3) will be able to handle user defined classes too? Do user classes will
 need to extend they will need to define the same approach?


 --
 Cesar Flores

Re: Spark-SQL and Hive - is Hive required?