HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-03-06 Thread nitinkak001
I am trying to run a Hive query from Spark using HiveContext. Here is the
code

/ val conf = new SparkConf().setAppName(HiveSparkIntegrationTest)

   
conf.set(spark.executor.extraClassPath,
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);
conf.set(spark.driver.extraClassPath,
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);
conf.set(spark.yarn.am.waitTime, 30L)

val sc = new SparkContext(conf)

val sqlContext = new HiveContext(sc)

def inputRDD = sqlContext.sql(describe
spark_poc.src_digital_profile_user);

inputRDD.collect().foreach { println }

println(inputRDD.schema.getClass.getName)
/

Getting this exception. Any clues? The weird part is if I try to do the same
thing but in Java instead of Scala, it runs fine.

/Exception in thread Driver java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
15/03/06 17:39:32 ERROR yarn.ApplicationMaster: SparkContext did not
initialize after waiting for 1 ms. Please check earlier log output for
errors. Failing the application.
Exception in thread main java.lang.NullPointerException
at
org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:218)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:110)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:434)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:433)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
15/03/06 17:39:32 INFO yarn.ApplicationMaster: AppMaster received a signal./



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-test-Spark-Context-did-not-initialize-after-waiting-1ms-tp21953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-03-06 Thread Marcelo Vanzin
On Fri, Mar 6, 2015 at 2:47 PM, nitinkak001 nitinkak...@gmail.com wrote:
 I am trying to run a Hive query from Spark using HiveContext. Here is the
 code

 / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest)


 conf.set(spark.executor.extraClassPath,
 /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);
 conf.set(spark.driver.extraClassPath,
 /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);
 conf.set(spark.yarn.am.waitTime, 30L)

You're missing /* at the end of your classpath entries. Also, since
you're on CDH 5.2, you'll probably need to filter out the guava jar
from Hive's lib directory, otherwise things might break. So things
will get a little more complicated.

With CDH 5.3 you shouldn't need to filter out the guava jar.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
i added it

On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz brk...@gmail.com wrote:

 Hi Koert,

 Would you like to register this on spark-packages.org?

 Burak

 On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote:

 currently spark provides many excellent algorithms for operations per key
 as long as the data send to the reducers per key fits in memory. operations
 like combineByKey, reduceByKey and foldByKey rely on pushing the operation
 map-side so that the data reduce-side is small. and groupByKey simply
 requires that the values per key fit in memory.

 but there are algorithms for which we would like to process all the
 values per key reduce-side, even when they do not fit in memory. examples
 are algorithms that need to process the values ordered, or algorithms that
 need to emit all values again. basically this is what the original hadoop
 reduce operation did so well: it allowed sorting of values (using secondary
 sort), and it processed all values per key in a streaming fashion.

 the library spark-sorted aims to bring these kind of operations back to
 spark, by providing a way to process values with a user provided
 Ordering[V] and a user provided streaming operation Iterator[V] =
 Iterator[W]. it does not make the assumption that the values need to fit in
 memory per key.

 the basic idea is to rely on spark's sort-based shuffle to re-arrange the
 data so that all values for a given key are placed consecutively within a
 single partition, and then process them using a map-like operation.

 you can find the project here:
 https://github.com/tresata/spark-sorted

 the project is in a very early stage. any feedback is very much
 appreciated.







Spark streaming and executor object reusage

2015-03-06 Thread Jean-Pascal Billaud
Hi,

Reading through the Spark Streaming Programming Guide, I read in the
Design Patterns for using foreachRDD:

Finally, this can be further optimized by reusing connection objects
across multiple RDDs/batches.
One can maintain a static pool of connection objects than can be reused as
RDDs of multiple batches are pushed to the external system

I have this connection pool that might be more or less heavy to
instantiate. I don't use it as part of a foreachRDD but as part of regular
map operations to query some api service. I'd like to understand what
multiple batches means here. Is this across RDDs on a single DStream?
Across multiple DStreams?

I'd like to understand what's the context sharability across DStreams over
time. Is it expected that the executor initializing my Factory will keep
getting batches from my streaming job while using the same singleton
connection pool over and over? Or Spark resets executors states after each
DStream is completed to allocated executors to other streaming job
potentially?

Thanks,


Re: Spark code development practice

2015-03-06 Thread Sean Owen
Hm, why do you expect a factory method over a constructor? no, you
instantiate a SparkContext (if not working in the shell).

When you write your own program, you parse your own command line args.
--master yarn-client doesn't do anything unless you make it do so.
That is an arg to *Spark* programs.

In many cases these just end up setting a value on SparkConf. Not
always though. you would need to trace through programs like
SparkSubmit to see how to emulate the effect of some args.

Unless you really need to do this, you should not. Write an app that
you run with spark-submit.

On Fri, Mar 6, 2015 at 2:56 AM, Xi Shen davidshe...@gmail.com wrote:
 Thanks guys, this is very useful :)

 @Stephen, I know spark-shell will create a SC for me. But I don't understand
 why we still need to do new SparkContext(...) in our code. Shouldn't we
 get it from some where? e.g. SparkContext.get.

 Another question, if I want my spark code to run in YARN later, how should I
 create the SparkContext? Or I can just specify --marst yarn on command
 line?


 Thanks,
 David


 On Fri, Mar 6, 2015 at 12:38 PM Koen Vantomme koen.vanto...@gmail.com
 wrote:

 use the spark-shell command and the shell will open
 type :paste abd then paste your code, after control-d

 open spark-shell:
 sparks/bin
 ./spark-shell

 Verstuurd vanaf mijn iPhone

 Op 6-mrt.-2015 om 02:28 heeft fightf...@163.com fightf...@163.com het
 volgende geschreven:

 Hi,

 You can first establish a scala ide to develop and debug your spark
 program, lets say, intellij idea or eclipse.

 Thanks,
 Sun.

 
 fightf...@163.com


 From: Xi Shen
 Date: 2015-03-06 09:19
 To: user@spark.apache.org
 Subject: Spark code development practice
 Hi,

 I am new to Spark. I see every spark program has a main() function. I
 wonder if I can run the spark program directly, without using spark-submit.
 I think it will be easier for early development and debug.


 Thanks,
 David



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread fenghaixiong
you can read this document :
http://spark.apache.org/docs/latest/building-spark.html
this might can solve you question
and if you compile spark with maven you might need to set mave option like this 
 befor you start compile it
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m

thanks
On Fri, Mar 06, 2015 at 07:10:34PM +1100, Night Wolf wrote:
 Tried with that. No luck. Same error on abt-interface jar. I can see maven
 downloaded that jar into my .m2 cache
 
 On Friday, March 6, 2015, 鹰 980548...@qq.com wrote:
 
  try it with mvn  -DskipTests -Pscala-2.11 clean install package

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread Sean Owen
Are you letting Spark download and run zinc for you? maybe that copy
is incomplete or corrupted. You can try removing the downloaded zinc
from build/ and try again.

Or run your own zinc.

On Fri, Mar 6, 2015 at 7:51 AM, Night Wolf nightwolf...@gmail.com wrote:
 Hey,

 Trying to build latest spark 1.3 with Maven using

 -DskipTests clean install package

 But I'm getting errors with zinc, in the logs I see;

 [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-network-common_2.11 ---


 ...

 [error] Required file not found: sbt-interface.jar
 [error] See zinc -help for information about locating necessary files


 Any ideas?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Integer column in schema RDD from parquet being considered as string

2015-03-06 Thread gtinside
Hi tsingfu ,

Thanks for your reply, I tried with other columns but the problem is same
with other Integer columns.

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Integer-column-in-schema-RDD-from-parquet-being-considered-as-string-tp21917p21950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Found this thread:
http://search-hadoop.com/m/JW1q5HMrge2

Cheers

On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote:

 This was discussed in the past and viewed as dangerous to enable. The
 biggest problem, by far, comes when you have a job that output M
 partitions, 'overwriting' a directory of data containing N  M old
 partitions. You suddenly have a mix of new and old data.

 It doesn't match Hadoop's semantics either, which won't let you do
 this. You can of course simply remove the output directory.

 On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote:
  Adding support for overwrite flag would make saveAsXXFile more user
 friendly.
 
  Cheers
 
 
 
  On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
 
  Hi folks,
 
  I found that RDD:saveXXFile has no overwrite flag which I think is very
 helpful. Is there any reason for this ?
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Nan Zhu
Actually, except setting spark.hadoop.validateOutputSpecs to false to disable 
output validation for the whole program  

Spark implementation uses a Dynamic Variable (object PairRDDFunctions) 
internally to disable it in a case-by-case manner

val disableOutputSpecValidation: DynamicVariable[Boolean] = new 
DynamicVariable[Boolean](false)

I’m not sure if there is enough amount of benefits to make it worth exposing 
this variable to the user…  

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote:

 Found this thread:
 http://search-hadoop.com/m/JW1q5HMrge2
  
 Cheers
  
 On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  This was discussed in the past and viewed as dangerous to enable. The
  biggest problem, by far, comes when you have a job that output M
  partitions, 'overwriting' a directory of data containing N  M old
  partitions. You suddenly have a mix of new and old data.
   
  It doesn't match Hadoop's semantics either, which won't let you do
  this. You can of course simply remove the output directory.
   
  On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com 
  (mailto:yuzhih...@gmail.com) wrote:
   Adding support for overwrite flag would make saveAsXXFile more user 
   friendly.
  
   Cheers
  
  
  
   On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com 
   (mailto:zjf...@gmail.com) wrote:
  
   Hi folks,
  
   I found that RDD:saveXXFile has no overwrite flag which I think is very 
   helpful. Is there any reason for this ?
  
  
  
   --
   Best Regards
  
   Jeff Zhang
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
   (mailto:user-unsubscr...@spark.apache.org)
   For additional commands, e-mail: user-h...@spark.apache.org 
   (mailto:user-h...@spark.apache.org)
  
  



Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-06 Thread Zsolt Tóth
Hi,

I submit spark jobs in yarn-cluster mode remotely from java code by calling
Client.submitApplication(). For some reason I want to use 1.3.0 jars on the
client side (e.g spark-yarn_2.10-1.3.0.jar) but I have
spark-assembly-1.2.1* on the cluster.
The problem is that the ApplicationMaster can't find the user application
jar (specified with --jar option). I think this is because of changes in
the classpath population in the Client class.
Is it possible to make this setup work without changing the codebase or the
jars?

Cheers,
Zsolt


Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Edmon Begoli
 Does Spark-SQL require installation of Hive for it to run correctly or not?

I could not tell from this statement:
https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

Thank you,
Edmon


Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Sean Owen
This was discussed in the past and viewed as dangerous to enable. The
biggest problem, by far, comes when you have a job that output M
partitions, 'overwriting' a directory of data containing N  M old
partitions. You suddenly have a mix of new and old data.

It doesn't match Hadoop's semantics either, which won't let you do
this. You can of course simply remove the output directory.

On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote:
 Adding support for overwrite flag would make saveAsXXFile more user friendly.

 Cheers



 On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:

 Hi folks,

 I found that RDD:saveXXFile has no overwrite flag which I think is very 
 helpful. Is there any reason for this ?



 --
 Best Regards

 Jeff Zhang

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Adding support for overwrite flag would make saveAsXXFile more user friendly. 

Cheers



 On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
 
 Hi folks,
 
 I found that RDD:saveXXFile has no overwrite flag which I think is very 
 helpful. Is there any reason for this ?
 
 
 
 -- 
 Best Regards
 
 Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Data Frame types

2015-03-06 Thread Cesar Flores
The SchemaRDD supports the storage of user defined classes. However, in
order to do that, the user class needs to extends the UserDefinedType interface
(see for example VectorUDT in org.apache.spark.mllib.linalg).

My question is: Do the new Data Frame Structure (to be released in spark
1.3) will be able to handle user defined classes too? Do user classes will
need to extend they will need to define the same approach?


-- 
Cesar Flores


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai
Hi Edmon,

No, you do not need to install Hive to use Spark SQL.

Thanks,

Yin

On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon



[SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread James
Hello,

I want to execute a hql script through `spark-sql` command, my script
contains:

```
ALTER TABLE xxx
DROP PARTITION (date_key = ${hiveconf:CUR_DATE});
```

when I execute

```
spark-sql -f script.hql -hiveconf CUR_DATE=20150119
```

It throws an error like
```
cannot recognize input near '$' '{' 'hiveconf' in constant
```

I have try on hive and it works. Thus how could I pass a parameter like
date to a hql script?

Alcaid


Re: LBGFS optimizer performace

2015-03-06 Thread Gustavo Enrique Salazar Torres
Hi there:

Yeah, I came to that same conclusion after tuning spark sql shuffle
parameter. Also cut out some classes I was using to parse my dataset and
finally created schema only with the fields needed for my model (before
that I was creating it with 63 fields while I just needed 15).
So I came with this set of parameters:

--num-executors 200
--executor-memory 800M
--conf spark.executor.extraJavaOptions=-XX:+UseCompressedOops
-XX:+AggressiveOpts
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.storage.memoryFraction=0.3
--conf spark.rdd.compress=true
--conf spark.sql.shuffle.partitions=4000
--driver-memory 4G

Now I processed 270 GB in 35 minutes and no OOM errors.
I have one question though: Does Spark SQL handle skewed tables? I was
wondering about that since my data has that feature and maybe there is more
room for performance improvement.

Thanks again.

Gustavo


On Thu, Mar 5, 2015 at 6:45 PM, DB Tsai dbt...@dbtsai.com wrote:

 PS, I will recommend you compress the data when you cache the RDD.
 There will be some overhead in compression/decompression, and
 serialization/deserialization, but it will help a lot for iterative
 algorithms with ability to caching more data.

 Sincerely,

 DB Tsai
 ---
 Blog: https://www.dbtsai.com


 On Tue, Mar 3, 2015 at 2:27 PM, Gustavo Enrique Salazar Torres
 gsala...@ime.usp.br wrote:
  Yeah, I can call count before that and it works. Also I was over caching
  tables but I removed those. Now there is no caching but it gets really
 slow
  since it calculates my table RDD many times.
  Also hacked the LBFGS code to pass the number of examples which I
 calculated
  outside in a Spark SQL query but just moved the location of the problem.
 
  The query I'm running looks like this:
 
  sSELECT $mappedFields, tableB.id as b_id FROM tableA LEFT JOIN tableB
 ON
  tableA.id=tableB.table_a_id WHERE tableA.status='' OR tableB.status='' 
 
  mappedFields contains a list of fields which I'm interested in. The
 result
  of that query goes through (including sampling) some transformations
 before
  being input to LBFGS.
 
  My dataset has 180GB just for feature selection, I'm planning to use
 450GB
  to train the final model and I'm using 16 c3.2xlarge EC2 instances, that
  means I have 240GB of RAM available.
 
  Any suggestion? I'm starting to check the algorithm because I don't
  understand why it needs to count the dataset.
 
  Thanks
 
  Gustavo
 
  On Tue, Mar 3, 2015 at 6:08 PM, Joseph Bradley jos...@databricks.com
  wrote:
 
  Is that error actually occurring in LBFGS?  It looks like it might be
  happening before the data even gets to LBFGS.  (Perhaps the outer join
  you're trying to do is making the dataset size explode a bit.)  Are you
 able
  to call count() (or any RDD action) on the data before you pass it to
 LBFGS?
 
  On Tue, Mar 3, 2015 at 8:55 AM, Gustavo Enrique Salazar Torres
  gsala...@ime.usp.br wrote:
 
  Just did with the same error.
  I think the problem is the data.count() call in LBFGS because for
 huge
  datasets that's naive to do.
  I was thinking to write my version of LBFGS but instead of doing
  data.count() I will pass that parameter which I will calculate from a
 Spark
  SQL query.
 
  I will let you know.
 
  Thanks
 
 
  On Tue, Mar 3, 2015 at 3:25 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:
 
  Can you try increasing your driver memory, reducing the executors and
  increasing the executor memory?
 
  Thanks
  Best Regards
 
  On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres
  gsala...@ime.usp.br wrote:
 
  Hi there:
 
  I'm using LBFGS optimizer to train a logistic regression model. The
  code I implemented follows the pattern showed in
  https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but
 training
  data is obtained from a Spark SQL RDD.
  The problem I'm having is that LBFGS tries to count the elements in
 my
  RDD and that results in a OOM exception since my dataset is huge.
  I'm running on a AWS EMR cluster with 16 c3.2xlarge instances on
 Hadoop
  YARN. My dataset is about 150 GB but I sample (I take only 1% of the
 data)
  it in order to scale logistic regression.
  The exception I'm getting is this:
 
  15/03/03 04:21:44 WARN scheduler.TaskSetManager: Lost task 108.0 in
  stage 2.0 (TID 7600, ip-10-155-20-71.ec2.internal):
  java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOfRange(Arrays.java:2694)
  at java.lang.String.init(String.java:203)
  at
  com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
  at
 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
  at
 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
  at
  com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
  at
 
 

Re: Optimizing SQL Query

2015-03-06 Thread daniel queiroz
Dude,

please, attach the execution plan of the query and details about the
indexes.



2015-03-06 9:07 GMT-03:00 anu anamika.guo...@gmail.com:

 I have a query that's like:

 Could you help in providing me pointers as to how to start to optimize it
 w.r.t. spark sql:


 sqlContext.sql(

 SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

 FROM (
SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
 SDP_USAGE

FROM (
  SELECT *

  FROM date_d dd JOIN interval_f intf

  ON intf.DATE_WID = dd.WID

  WHERE intf.DATE_WID = 20141116 AND
 intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) =
 '2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME
AS STRING) = '2014-11-26
 00:00:00.000' AND MEAS_WID = 3

   ) test JOIN sdp_d sdp

ON test.SDP_WID = sdp.WID

WHERE sdp.UDC_ID = 'SP-1931201848'

GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

) dw

 GROUP BY dw.DAY_OF_WEEK, dw.HOUR)



 Currently the query takes 15 minutes execution time where interval_f table
 holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
Do you have a reference paper to the implemented algorithm in TSQR.scala ?

On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can use
 out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao





spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
like combineByKey, reduceByKey and foldByKey rely on pushing the operation
map-side so that the data reduce-side is small. and groupByKey simply
requires that the values per key fit in memory.

but there are algorithms for which we would like to process all the values
per key reduce-side, even when they do not fit in memory. examples are
algorithms that need to process the values ordered, or algorithms that need
to emit all values again. basically this is what the original hadoop reduce
operation did so well: it allowed sorting of values (using secondary sort),
and it processed all values per key in a streaming fashion.

the library spark-sorted aims to bring these kind of operations back to
spark, by providing a way to process values with a user provided
Ordering[V] and a user provided streaming operation Iterator[V] =
Iterator[W]. it does not make the assumption that the values need to fit in
memory per key.

the basic idea is to rely on spark's sort-based shuffle to re-arrange the
data so that all values for a given key are placed consecutively within a
single partition, and then process them using a map-like operation.

you can find the project here:
https://github.com/tresata/spark-sorted

the project is in a very early stage. any feedback is very much appreciated.


Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Since we already have spark.hadoop.validateOutputSpecs config, I think
there is not much need to expose disableOutputSpecValidation

Cheers

On Fri, Mar 6, 2015 at 7:34 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Actually, except setting spark.hadoop.validateOutputSpecs to false to
 disable output validation for the whole program

 Spark implementation uses a Dynamic Variable (object PairRDDFunctions)
 internally to disable it in a case-by-case manner

 val disableOutputSpecValidation: DynamicVariable[Boolean] = new 
 DynamicVariable[Boolean](false)


 I’m not sure if there is enough amount of benefits to make it worth exposing 
 this variable to the user…


 Best,


 --
 Nan Zhu
 http://codingcat.me

 On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote:

 Found this thread:
 http://search-hadoop.com/m/JW1q5HMrge2

 Cheers

 On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote:

 This was discussed in the past and viewed as dangerous to enable. The
 biggest problem, by far, comes when you have a job that output M
 partitions, 'overwriting' a directory of data containing N  M old
 partitions. You suddenly have a mix of new and old data.

 It doesn't match Hadoop's semantics either, which won't let you do
 this. You can of course simply remove the output directory.

 On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote:
  Adding support for overwrite flag would make saveAsXXFile more user
 friendly.
 
  Cheers
 
 
 
  On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote:
 
  Hi folks,
 
  I found that RDD:saveXXFile has no overwrite flag which I think is very
 helpful. Is there any reason for this ?
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 






Re: Data Frame types

2015-03-06 Thread Jaonary Rabarisoa
Hi Cesar,

Yes, you can define an UDT with the new DataFrame, the same way that
SchemaRDD did.

Jaonary

On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote:


 The SchemaRDD supports the storage of user defined classes. However, in
 order to do that, the user class needs to extends the UserDefinedType 
 interface
 (see for example VectorUDT in org.apache.spark.mllib.linalg).

 My question is: Do the new Data Frame Structure (to be released in spark
 1.3) will be able to handle user defined classes too? Do user classes will
 need to extend they will need to define the same approach?


 --
 Cesar Flores



Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
Hi Shivaram,

Thank you for the link. I'm trying to figure out how can I port this to
mllib. May you can help me to understand how pieces fit together.
Currently, in mllib there's different types of distributed matrix :

BlockMatrix, CoordinateMatrix, IndexedRowMatrix and RowMatrix. Which one
should correspond to RowPartitionedMatrix in ml-matrix ?



On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can use
 out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao





Re: Building Spark 1.3 for Scala 2.11 using Maven

2015-03-06 Thread Sean Owen
-Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this profile.

Why are you running install package and not just install? Probably
doesn't matter.

This sounds like you are trying to only build core without building
everything else, which you can't do in general unless you already
built and installed these snapshot artifacts locally.

On Fri, Mar 6, 2015 at 12:46 AM, Night Wolf nightwolf...@gmail.com wrote:
 Hey guys,

 Trying to build Spark 1.3 for Scala 2.11.

 I'm running with the folllowng Maven command;

 -DskipTests -Dscala-2.11 clean install package


 Exception:

 [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve
 dependencies for project
 org.apache.spark:spark-core_2.10:jar:1.3.0-SNAPSHOT: The following artifacts
 could not be resolved:
 org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT,
 org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT: Failure to
 find org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT in
 http://repository.apache.org/snapshots was cached in the local repository,
 resolution will not be reattempted until the update interval of
 apache.snapshots has elapsed or updates are forced - [Help 1]


 I see these warnings in the log before this error:


 [INFO]
 [INFO]
 
 [INFO] Building Spark Project Core 1.3.0-SNAPSHOT
 [INFO]
 
 [WARNING] The POM for
 org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT is missing, no
 dependency information available
 [WARNING] The POM for
 org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT is missing,
 no dependency information available


 Any ideas?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Optimizing SQL Query

2015-03-06 Thread anu
I have a query that's like:

Could you help in providing me pointers as to how to start to optimize it
w.r.t. spark sql:


sqlContext.sql(

SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE

FROM (
   SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS
SDP_USAGE

   FROM (
 SELECT *

 FROM date_d dd JOIN interval_f intf

 ON intf.DATE_WID = dd.WID

 WHERE intf.DATE_WID = 20141116 AND
intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) =
'2014-11-16 00:00:00.000' AND  CAST(INTERVAL_END_TIME 
   AS STRING) = '2014-11-26
00:00:00.000' AND MEAS_WID = 3

  ) test JOIN sdp_d sdp

   ON test.SDP_WID = sdp.WID

   WHERE sdp.UDC_ID = 'SP-1931201848'

   GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID

   ) dw

GROUP BY dw.DAY_OF_WEEK, dw.HOUR)



Currently the query takes 15 minutes execution time where interval_f table
holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-stream programme failed on yarn-client

2015-03-06 Thread fenghaixiong
Thanks ,you advise is usefull I just submit my job on my spark client which 
config with simple configure file so it failed 
when i run my job on service machine everything is okay 

 
On Fri, Mar 06, 2015 at 02:10:04PM +0530, Akhil Das wrote:
 Looks like an issue with your yarn setup, could you try doing a simple
 example with spark-shell?
 
 Start the spark shell as:
 
 $*MASTER=yarn-client bin/spark-shell*
 *spark-shell *sc.parallelize(1 to 1000).collect
 
 ​If that doesn't work, then make sure your yarn services are up and running
 and in your spark-env.sh you may set the corresponding configurations from
 the following:​
 
 
 # Options read in YARN client mode
 # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
 # - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
 # - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
 # - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
 # - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512
 Mb)
 
 
 Thanks
 Best Regards
 
 On Fri, Mar 6, 2015 at 1:09 PM, fenghaixiong 980548...@qq.com wrote:
 
  Hi all,
 
 
  I'm try to write a spark stream programme so i read the spark
  online document ,according the document i write the programe like this :
 
  import org.apache.spark.SparkConf
  import org.apache.spark.streaming._
  import org.apache.spark.streaming.StreamingContext._
 
  object SparkStreamTest {
def main(args: Array[String]) {
  val conf = new SparkConf()
  val ssc = new StreamingContext(conf, Seconds(1))
  val lines = ssc.socketTextStream(args(0), args(1).toInt)
  val words = lines.flatMap(_.split( ))
  val pairs = words.map(word = (word, 1))
  val wordCounts = pairs.reduceByKey(_ + _)
  wordCounts.print()
  ssc.start() // Start the computation
  ssc.awaitTermination()  // Wait for the computation to terminate
}
 
  }
 
 
 
  for test i first start listen a port by this:
   nc -lk 
 
  and then i submit job by
  spark-submit  --master local[2] --class com.nd.hxf.SparkStreamTest
  spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
 
  everything is okay
 
 
  but when i run it on yarn by this :
  spark-submit  --master yarn-client --class com.nd.hxf.SparkStreamTest
  spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 
 
  it wait for a longtime and repeat output somemessage a apart of the output
  is like this:
 
 
 
 
 
 
 
  15/03/06 15:30:24 INFO YarnClientSchedulerBackend: SchedulerBackend is
  ready for scheduling beginning after waiting
  maxRegisteredResourcesWaitingTime: 3(ms)
  15/03/06 15:30:24 INFO ReceiverTracker: ReceiverTracker started
  15/03/06 15:30:24 INFO ForEachDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO ShuffledDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO MappedDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO FlatMappedDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO SocketInputDStream: metadataCleanupDelay = -1
  15/03/06 15:30:24 INFO SocketInputDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO SocketInputDStream: Storage level =
  StorageLevel(false, false, false, false, 1)
  15/03/06 15:30:24 INFO SocketInputDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO SocketInputDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO SocketInputDStream: Initialized and validated
  org.apache.spark.streaming.dstream.SocketInputDStream@b01c5f8
  15/03/06 15:30:24 INFO FlatMappedDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO FlatMappedDStream: Storage level =
  StorageLevel(false, false, false, false, 1)
  15/03/06 15:30:24 INFO FlatMappedDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO FlatMappedDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO FlatMappedDStream: Initialized and validated
  org.apache.spark.streaming.dstream.FlatMappedDStream@6bd47453
  15/03/06 15:30:24 INFO MappedDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO MappedDStream: Storage level = StorageLevel(false,
  false, false, false, 1)
  15/03/06 15:30:24 INFO MappedDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO MappedDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO MappedDStream: Initialized and validated
  org.apache.spark.streaming.dstream.MappedDStream@941451f
  15/03/06 15:30:24 INFO ShuffledDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO ShuffledDStream: Storage level =
  StorageLevel(false, false, false, false, 1)
  15/03/06 15:30:24 INFO ShuffledDStream: Checkpoint interval = null
  15/03/06 15:30:24 INFO ShuffledDStream: Remember duration = 1000 ms
  15/03/06 15:30:24 INFO ShuffledDStream: Initialized and validated
  org.apache.spark.streaming.dstream.ShuffledDStream@42eba6ee
  15/03/06 15:30:24 INFO ForEachDStream: Slide time = 1000 ms
  15/03/06 15:30:24 INFO ForEachDStream: Storage level = StorageLevel(false,
  false, false, false, 1)
  15/03/06 

No overwrite flag for saveAsXXFile

2015-03-06 Thread Jeff Zhang
Hi folks,

I found that RDD:saveXXFile has no overwrite flag which I think is very
helpful. Is there any reason for this ?



-- 
Best Regards

Jeff Zhang


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext.  From earlier in that doc:

In addition to the basic SQLContext, you can also create a HiveContext,
 which provides a superset of the functionality provided by the basic
 SQLContext. Additional features include the ability to write queries using
 the more complete HiveQL parser, access to HiveUDFs, and the ability to
 read data from Hive tables. To use a HiveContext, *you do not need to
 have an existing Hive setup*, and all of the data sources available to a
 SQLContext are still available. HiveContext is only packaged separately to
 avoid including all of Hive’s dependencies in the default Spark build. If
 these dependencies are not a problem for your application then using
 HiveContext is recommended for the 1.2 release of Spark. Future releases
 will focus on bringing SQLContext up to feature parity with a HiveContext.


On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote:

 Hi Edmon,

 No, you do not need to install Hive to use Spark SQL.

 Thanks,

 Yin

 On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon





Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
First, thanks to everyone for their assistance and recommendations.

@Marcelo

I applied the patch that you recommended and am now able to get into the
shell, thank you worked great after I realized that the pom was pointing to
the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

@Zhan

Need to apply this patch next.  I tried to start the spark-thriftserver but
and it starts, then fails with like this:  I have the entries in my
spark-default.conf, but not the patch applied.

./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m
--hiveconf hive.server2.thrift.port=10001

5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at
http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO
impl.TimelineClientImpl: Timeline service address:
http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06
12:34:18 INFO client.RMProxy: Connecting to ResourceManager at
hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO
yarn.Client: Requesting a new application from cluster with 4
NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our
application has not requested more than the maximum memory capability
of the cluster (8192 MB per container)15/03/06 12:34:18 INFO
yarn.Client: Will allocate AM container, with 896 MB memory including
384 MB overhead15/03/06 12:34:18 INFO yarn.Client: Setting up
container launch context for our AM15/03/06 12:34:18 INFO yarn.Client:
Preparing resources for our AM container15/03/06 12:34:19 WARN
shortcircuit.DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.15/03/06
12:34:19 INFO yarn.Client: Uploading resource
file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar
- 
hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06
12:34:21 INFO yarn.Client: Setting up the launch environment for our
AM container15/03/06 12:34:21 INFO spark.SecurityManager: Changing
view acls to: root15/03/06 12:34:21 INFO spark.SecurityManager:
Changing modify acls to: root15/03/06 12:34:21 INFO
spark.SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(root); users with
modify permissions: Set(root)15/03/06 12:34:21 INFO yarn.Client:
Submitting application 18 to ResourceManager15/03/06 12:34:21 INFO
impl.YarnClientImpl: Submitted application
application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client:
Application report for application_1425078697953_0018 (state:
ACCEPTED)15/03/06 12:34:22 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL:
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root15/03/06 12:34:23 INFO yarn.Client: Application report
for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:24
INFO yarn.Client: Application report for
application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:25 INFO
yarn.Client: Application report for application_1425078697953_0018
(state: ACCEPTED)15/03/06 12:34:26 INFO yarn.Client: Application
report for application_1425078697953_0018 (state: ACCEPTED)15/03/06
12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster
registered as 
Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763]15/03/06
12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter.
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
Map(PROXY_HOSTS - hadoopdev02.opsdatastore.com, PROXY_URI_BASES -
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018),
/proxy/application_1425078697953_001815/03/06 12:34:27 INFO
ui.JettyUtils: Adding filter:
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter15/03/06
12:34:27 INFO yarn.Client: Application report for
application_1425078697953_0018 (state: RUNNING)15/03/06 12:34:27 INFO
yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: hadoopdev08.opsdatastore.com
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL:
http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root15/03/06 12:34:27 INFO
cluster.YarnClientSchedulerBackend: Application
application_1425078697953_0018 has started running.15/03/06 12:34:28
INFO netty.NettyBlockTransferService: Server created on 4612415/03/06
12:34:28 INFO storage.BlockManagerMaster: Trying to register
BlockManager15/03/06 12:34:28 INFO storage.BlockManagerMasterActor:
Registering block manager hadoopdev01.opsdatastore.com:46124 with
265.4 MB RAM, BlockManagerId(driver, hadoopdev01.opsdatastore.com,
46124)15/03/06 12:34:28 INFO storage.BlockManagerMaster: Registered

SparkSQL supports hive insert overwrite directory?

2015-03-06 Thread ogoh
Hello,
I am using Spark 1.2.1 along with Hive 0.13.1.
I run some hive queries by using beeline and Thriftserver. 
Queries I tested so far worked well except the followings:
I want to export the query output into a file at either HDFS or local fs
(ideally local fs).
There are not yet supported?
The spark github has already unit tests using insert overwrite directory
in
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/insert_overwrite_local_directory_1.q.

$insert overwrite directory 'hdfs directory name' select * from temptable; 
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_DIR
'/user/ogoh/table'
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_DIR
'/user/bob/table'

$insert overwrite local directory 'hdfs directory name' select * from
temptable; ;
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
temptable
  TOK_INSERT
TOK_DESTINATION
  TOK_LOCAL_DIR
/user/bob/table
TOK_SELECT
  TOK_SELEXPR
TOK_ALLCOLREF

scala.NotImplementedError: No parse rules for:
 TOK_DESTINATION
  TOK_LOCAL_DIR
/user/ogoh/table

Thanks,
Okehee



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-supports-hive-insert-overwrite-directory-tp21951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
Hi,
I am filtering first DStream with the value in second DStream. I also want to 
keep the value of second Dstream. I have done the following and having problem 
with returning new RDD:
val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: 
RDD[(String,String)], rdd2 : RDD[Int]) = {                                     
                                   var first =  ; var second =  ; var third 
= 0                                                                          if 
(rdd2.first=3)                                                                 
               {                                                                
                        first = rdd1.map(_._1).first                            
                                                            second = 
rdd1.map(_._2).first                                                            
                            third = rdd2.first                                  
                                              }                                 
                                       RDD[(first,second,third)]                
                                                               })
ERROR/home/hduser/Projects/scalaad/src/main/scala/eeg/anomd/StreamAnomalyDetector.scala:119:
 error: not found: value RDD[ERROR]  RDD[(first,second,third)] 
I am imported the import org.apache.spark.rdd.RDD

Regards,Laeeq


Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee
Hi Rares,

If you dig into the descriptions for the two jobs, it will probably return
something like:

Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...

Job ID: 0
org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...

The code for Spark from the git copy of master at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

Basically, line 428 refers to
val initialCount = this.count()

And liine 447 refers to
var samples = this.sample(withReplacement, fraction,
rand.nextInt()).collect()

Basically, the first job is getting the count so you can do the second job
which is to generate the samples.

HTH!
Denny




On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica rvern...@gmail.com wrote:

 Hello,

 I am using takeSample from the Scala Spark 1.2.1 shell:

 scala sc.textFile(README.md).takeSample(false, 3)


 and I notice that two jobs are generated on the Spark Jobs page:

 Job Id Description
 1 takeSample at console:13
 0  takeSample at console:13


 Any ideas why the two jobs are needed?

 Thanks!
 Rares



Re: Visualize Spark Job

2015-03-06 Thread Phuoc Do
I have this PR submitted. You can merge it and try.

https://github.com/apache/spark/pull/2077

On Thu, Jan 15, 2015 at 12:50 AM, Kuromatsu, Nobuyuki 
n.kuroma...@jp.fujitsu.com wrote:

 Hi

 I want to visualize tasks and stages in order to analyze spark jobs.
 I know necessary metrics is written in spark.eventLog.dir.

 Does anyone know the tool like swimlanes in Tez?

 Regards,

 Nobuyuki Kuromatsu

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Phuoc Do
https://vida.io/dnprock


Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread Zhan Zhang
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash)

Thanks.

Zhan Zhang

On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote:

 Hello, 
 
 I want to execute a hql script through `spark-sql` command, my script 
 contains: 
 
 ```
 ALTER TABLE xxx 
 DROP PARTITION (date_key = ${hiveconf:CUR_DATE});
 ```
 
 when I execute 
 
 ```
 spark-sql -f script.hql -hiveconf CUR_DATE=20150119
 ```
 
 It throws an error like 
 ```
 cannot recognize input near '$' '{' 'hiveconf' in constant
 ```
 
 I have try on hive and it works. Thus how could I pass a parameter like date 
 to a hql script?
 
 Alcaid


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



takeSample triggers 2 jobs

2015-03-06 Thread Rares Vernica
Hello,

I am using takeSample from the Scala Spark 1.2.1 shell:

scala sc.textFile(README.md).takeSample(false, 3)


and I notice that two jobs are generated on the Spark Jobs page:

Job Id Description
1 takeSample at console:13
0  takeSample at console:13


Any ideas why the two jobs are needed?

Thanks!
Rares


Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread sandeep vura
Hi ,

For creating a Hive table do i need to add hive-site.xml in spark/conf
directory.

On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com
wrote:

 Its not required, but even if you don't have hive installed you probably
 still want to use the HiveContext.  From earlier in that doc:

 In addition to the basic SQLContext, you can also create a HiveContext,
 which provides a superset of the functionality provided by the basic
 SQLContext. Additional features include the ability to write queries using
 the more complete HiveQL parser, access to HiveUDFs, and the ability to
 read data from Hive tables. To use a HiveContext, *you do not need to
 have an existing Hive setup*, and all of the data sources available to a
 SQLContext are still available. HiveContext is only packaged separately to
 avoid including all of Hive’s dependencies in the default Spark build. If
 these dependencies are not a problem for your application then using
 HiveContext is recommended for the 1.2 release of Spark. Future releases
 will focus on bringing SQLContext up to feature parity with a HiveContext.


 On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote:

 Hi Edmon,

 No, you do not need to install Hive to use Spark SQL.

 Thanks,

 Yin

 On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon






Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Shivaram Venkataraman
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a
good reference

Shivaram
On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 Do you have a reference paper to the implemented algorithm in TSQR.scala ?

 On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can
 use out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao






Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Hi Todd,

Looks like the thrift server can connect to metastore, but something wrong in 
the executors. You can try to get the log with yarn logs -applicationID xxx” 
to check why it failed. If there is no log (master or executor is not started 
at all), you can go to the RM webpage, click the link to see why the shell 
failed in the first place.

Thanks.

Zhan Zhang

On Mar 6, 2015, at 9:59 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:

First, thanks to everyone for their assistance and recommendations.

@Marcelo

I applied the patch that you recommended and am now able to get into the shell, 
thank you worked great after I realized that the pom was pointing to the 
1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

@Zhan

Need to apply this patch next.  I tried to start the spark-thriftserver but and 
it starts, then fails with like this:  I have the entries in my 
spark-default.conf, but not the patch applied.


./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m --hiveconf 
hive.server2.thrift.port=10001

5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at 
http://hadoopdev01http://hadoopdev01/.opsdatastore.com:4040
15/03/06 12:34:18 INFO impl.TimelineClientImpl: Timeline service address: 
http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8188/ws/v1/timeline/
15/03/06 12:34:18 INFO client.RMProxy: Connecting to ResourceManager at 
hadoopdev02.opsdatastore.com/192.168.15.154:8050
15/03/06 12:34:18 INFO yarn.Client: Requesting a new application from cluster 
with 4 NodeManagers
15/03/06 12:34:18 INFO yarn.Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (8192 MB per container)
15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, with 896 MB 
memory including 384 MB overhead
15/03/06 12:34:18 INFO yarn.Client: Setting up container launch context for our 
AM
15/03/06 12:34:18 INFO yarn.Client: Preparing resources for our AM container
15/03/06 12:34:19 WARN shortcircuit.DomainSocketFactory: The short-circuit 
local reads feature cannot be used because libhadoop cannot be loaded.
15/03/06 12:34:19 INFO yarn.Client: Uploading resource 
file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar 
- 
hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar
15/03/06 12:34:21 INFO yarn.Client: Setting up the launch environment for our 
AM container
15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: root
15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: root
15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/03/06 12:34:21 INFO yarn.Client: Submitting application 18 to ResourceManager
15/03/06 12:34:21 INFO impl.YarnClientImpl: Submitted application 
application_1425078697953_0018
15/03/06 12:34:22 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:22 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL: 
http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8088/proxy/application_1425078697953_0018/
 user: root
15/03/06 12:34:23 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:24 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:25 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:26 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: ACCEPTED)
15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster 
registered as 
Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763]
15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS - 
hadoopdev02.opsdatastore.com, PROXY_URI_BASES - 
http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8088/proxy/application_1425078697953_0018),
 /proxy/application_1425078697953_0018
15/03/06 12:34:27 INFO ui.JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/03/06 12:34:27 INFO yarn.Client: Application report for 
application_1425078697953_0018 (state: RUNNING)
15/03/06 12:34:27 INFO yarn.Client:
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: hadoopdev08.opsdatastore.com
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1425663261755
 final status: UNDEFINED
 tracking URL: 

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura sandeepv...@gmail.com wrote:

 Can i get document how to create that setup .i mean i need hive
 integration on spark


http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables


Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
Hi Zhan,

I applied the patch you recommended,
https://github.com/apache/spark/pull/3409, it it now works. It was failing
with this:

Exception message:
/hadoop/yarn/local/usercache/root/appcache/application_1425078697953_0020/container_1425078697953_0020_01_02/launch_container.sh:
line 14:
$PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/
*${hdp.version}*/hadoop/lib/hadoop-lzo-0.6.0.*${hdp.version}*.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*:
*bad substitution*

While the spark-default.conf has these defined:

spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041


without the patch *${hdp.version} * was not being substituted.

Thanks for pointing me to that patch, appreciate it.

-Todd

On Fri, Mar 6, 2015 at 1:12 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Hi Todd,

  Looks like the thrift server can connect to metastore, but something
 wrong in the executors. You can try to get the log with yarn logs
 -applicationID xxx” to check why it failed. If there is no log (master or
 executor is not started at all), you can go to the RM webpage, click the
 link to see why the shell failed in the first place.

  Thanks.

  Zhan Zhang

  On Mar 6, 2015, at 9:59 AM, Todd Nist tsind...@gmail.com wrote:

  First, thanks to everyone for their assistance and recommendations.

  @Marcelo

  I applied the patch that you recommended and am now able to get into the
 shell, thank you worked great after I realized that the pom was pointing to
 the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1.

  @Zhan

  Need to apply this patch next.  I tried to start the spark-thriftserver
 but and it starts, then fails with like this:  I have the entries in my
 spark-default.conf, but not the patch applied.

   ./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m 
 --hiveconf hive.server2.thrift.port=10001

  5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at 
 http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO 
 impl.TimelineClientImpl: Timeline service address: 
 http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06 12:34:18 
 INFO client.RMProxy: Connecting to ResourceManager at 
 hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO 
 yarn.Client: Requesting a new application from cluster with 4 
 NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our application has 
 not requested more than the maximum memory capability of the cluster (8192 MB 
 per container)15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, 
 with 896 MB memory including 384 MB overhead15/03/06 12:34:18 INFO 
 yarn.Client: Setting up container launch context for our AM15/03/06 12:34:18 
 INFO yarn.Client: Preparing resources for our AM container15/03/06 12:34:19 
 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature 
 cannot be used because libhadoop cannot be loaded.15/03/06 12:34:19 INFO 
 yarn.Client: Uploading resource 
 file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar 
 - 
 hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06
  12:34:21 INFO yarn.Client: Setting up the launch environment for our AM 
 container15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: 
 root15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: 
 root15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(root); users with modify permissions: Set(root)15/03/06 12:34:21 INFO 
 yarn.Client: Submitting application 18 to ResourceManager15/03/06 12:34:21 
 INFO impl.YarnClientImpl: Submitted application 
 application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client: Application 
 report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:22 
 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: N/A
  ApplicationMaster RPC port: -1
  queue: default
  start time: 1425663261755
  final status: UNDEFINED
  tracking URL: 
 http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/
  user: root15/03/06 12:34:23 INFO yarn.Client: Application report for 
 

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Sorry. Misunderstanding. Looks like it already worked. If you still met some 
hdp.version problem, you can try it :)

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:40 AM, Zhan Zhang 
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:

You are using 1.2.1 right? If so, please add java-opts  in conf directory and 
give it a try.

[root@c6401 conf]# more java-opts
  -Dhdp.version=2.2.2.0-2041

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:35 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:

 -Dhdp.version=2.2.0.0-2041




Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
You are using 1.2.1 right? If so, please add java-opts  in conf directory and 
give it a try.

[root@c6401 conf]# more java-opts
  -Dhdp.version=2.2.2.0-2041

Thanks.

Zhan Zhang

On Mar 6, 2015, at 11:35 AM, Todd Nist 
tsind...@gmail.commailto:tsind...@gmail.com wrote:

 -Dhdp.version=2.2.0.0-2041



Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
Working great now, after applying that patch; thanks again.

On Fri, Mar 6, 2015 at 2:42 PM, Zhan Zhang zzh...@hortonworks.com wrote:

  Sorry. Misunderstanding. Looks like it already worked. If you still met
 some hdp.version problem, you can try it :)

  Thanks.

  Zhan Zhang

  On Mar 6, 2015, at 11:40 AM, Zhan Zhang zzh...@hortonworks.com wrote:

  You are using 1.2.1 right? If so, please add java-opts  in conf
 directory and give it a try.

  [root@c6401 conf]# more java-opts
   -Dhdp.version=2.2.2.0-2041

  Thanks.

  Zhan Zhang

  On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.com wrote:

  -Dhdp.version=2.2.0.0-2041






Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
Hi Koert,

Would you like to register this on spark-packages.org?

Burak

On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote:

 currently spark provides many excellent algorithms for operations per key
 as long as the data send to the reducers per key fits in memory. operations
 like combineByKey, reduceByKey and foldByKey rely on pushing the operation
 map-side so that the data reduce-side is small. and groupByKey simply
 requires that the values per key fit in memory.

 but there are algorithms for which we would like to process all the values
 per key reduce-side, even when they do not fit in memory. examples are
 algorithms that need to process the values ordered, or algorithms that need
 to emit all values again. basically this is what the original hadoop reduce
 operation did so well: it allowed sorting of values (using secondary sort),
 and it processed all values per key in a streaming fashion.

 the library spark-sorted aims to bring these kind of operations back to
 spark, by providing a way to process values with a user provided
 Ordering[V] and a user provided streaming operation Iterator[V] =
 Iterator[W]. it does not make the assumption that the values need to fit in
 memory per key.

 the basic idea is to rely on spark's sort-based shuffle to re-arrange the
 data so that all values for a given key are placed consecutively within a
 single partition, and then process them using a map-like operation.

 you can find the project here:
 https://github.com/tresata/spark-sorted

 the project is in a very early stage. any feedback is very much
 appreciated.






Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Only if you want to configure the connection to an existing hive metastore.

On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura sandeepv...@gmail.com wrote:

 Hi ,

 For creating a Hive table do i need to add hive-site.xml in spark/conf
 directory.

 On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Its not required, but even if you don't have hive installed you probably
 still want to use the HiveContext.  From earlier in that doc:

 In addition to the basic SQLContext, you can also create a HiveContext,
 which provides a superset of the functionality provided by the basic
 SQLContext. Additional features include the ability to write queries using
 the more complete HiveQL parser, access to HiveUDFs, and the ability to
 read data from Hive tables. To use a HiveContext, *you do not need to
 have an existing Hive setup*, and all of the data sources available to
 a SQLContext are still available. HiveContext is only packaged separately
 to avoid including all of Hive’s dependencies in the default Spark build.
 If these dependencies are not a problem for your application then using
 HiveContext is recommended for the 1.2 release of Spark. Future releases
 will focus on bringing SQLContext up to feature parity with a HiveContext.


 On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote:

 Hi Edmon,

 No, you do not need to install Hive to use Spark SQL.

 Thanks,

 Yin

 On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote:

  Does Spark-SQL require installation of Hive for it to run correctly or
 not?

 I could not tell from this statement:

 https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 Thank you,
 Edmon







Re: Data Frame types

2015-03-06 Thread Michael Armbrust
No, the UDT API is not a public API as we have not stabilized the
implementation.  For this reason its only accessible to projects inside of
Spark.

On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 Hi Cesar,

 Yes, you can define an UDT with the new DataFrame, the same way that
 SchemaRDD did.

 Jaonary

 On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote:


 The SchemaRDD supports the storage of user defined classes. However, in
 order to do that, the user class needs to extends the UserDefinedType 
 interface
 (see for example VectorUDT in org.apache.spark.mllib.linalg).

 My question is: Do the new Data Frame Structure (to be released in spark
 1.3) will be able to handle user defined classes too? Do user classes will
 need to extend they will need to define the same approach?


 --
 Cesar Flores





Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:56 AM, sandeep vura sandeepv...@gmail.com wrote:

 Yes i want to link with existing hive metastore. Is that the right way to
 link to hive metastore .


Yes.


Re: Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
Yes this is the problem. I want to return an RDD but it is abstract and I 
cannot instantiate it. So what are other options. I have two streams and I want 
to filter this stream on the basis of other and also want keep the value of 
other stream. I have also tried join. But one stream has more values than other 
in each sliding window and after join I get repetitions which I don't want.
Regards,Laeeq 

 On Friday, March 6, 2015 8:11 PM, Sean Owen so...@cloudera.com wrote:
   

 What is this line supposed to mean?

RDD[(first,second,third)]

It's not valid as a line of code, and you don't instantiate RDDs anyway.

On Fri, Mar 6, 2015 at 7:06 PM, Laeeq Ahmed
laeeqsp...@yahoo.com.invalid wrote:
 Hi,

 I am filtering first DStream with the value in second DStream. I also want
 to keep the value of second Dstream. I have done the following and having
 problem with returning new RDD:

 val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1:
 RDD[(String,String)], rdd2 : RDD[Int]) = {
                                                                  var first
 =  ; var second =  ; var third = 0
                                                                  if
 (rdd2.first=3)

 {

 first = rdd1.map(_._1).first

 second = rdd1.map(_._2).first

 third = rdd2.first

 }

 RDD[(first,second,third)]
                                                                })

 ERROR
 /home/hduser/Projects/scalaad/src/main/scala/eeg/anomd/StreamAnomalyDetector.scala:119:
 error: not found: value RDD
 [ERROR] RDD[(first,second,third)]

 I am imported the import org.apache.spark.rdd.RDD


 Regards,
 Laeeq



   

Re: Spark Streaming Switchover Time

2015-03-06 Thread Tathagata Das
It is probably the time taken by the system to figure out that the worker
is down. Could you see in the logs to find what goes on when you kill the
worker?

TD

On Wed, Mar 4, 2015 at 6:20 AM, Nastooh Avessta (navesta) nave...@cisco.com
 wrote:

  Indeed. And am wondering if this switchover time can be decreased.

 Cheers,



 [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

 *Nastooh Avessta*
 ENGINEER.SOFTWARE ENGINEERING
 nave...@cisco.com
 Phone: *+1 604 647 1527 %2B1%20604%20647%201527*

 *Cisco Systems Limited*
 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
 VANCOUVER
 BRITISH COLUMBIA
 V7X 1J1
 CA
 Cisco.com http://www.cisco.com/



 [image: Think before you print.]Think before you print.

 This email may contain confidential and privileged material for the sole
 use of the intended recipient. Any review, use, distribution or disclosure
 by others is strictly prohibited. If you are not the intended recipient (or
 authorized to receive for the recipient), please contact the sender by
 reply email and delete all copies of this message.

 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/index.html

 Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J
 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences
 http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe
 http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy
 http://www.cisco.com/web/siteassets/legal/privacy.html*



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* Tuesday, March 03, 2015 11:11 PM

 *To:* Nastooh Avessta (navesta)
 *Cc:* user@spark.apache.org
 *Subject:* Re: Spark Streaming Switchover Time



 I am confused. Are you killing the 1st worker node to see whether the
 system restarts the receiver on the second worker?



 TD



 On Tue, Mar 3, 2015 at 10:49 PM, Nastooh Avessta (navesta) 
 nave...@cisco.com wrote:

 This is the time that it takes for the driver to start receiving data once
 again, from the 2nd worker, when the 1st worker, where streaming thread
 was initially running, is shutdown.

 Cheers,



 [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

 *Nastooh Avessta*
 ENGINEER.SOFTWARE ENGINEERING
 nave...@cisco.com
 Phone: *+1 604 647 1527 %2B1%20604%20647%201527*

 *Cisco Systems Limited*
 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
 VANCOUVER
 BRITISH COLUMBIA
 V7X 1J1
 CA
 Cisco.com http://www.cisco.com/



 [image: Think before you print.]Think before you print.

 This email may contain confidential and privileged material for the sole
 use of the intended recipient. Any review, use, distribution or disclosure
 by others is strictly prohibited. If you are not the intended recipient (or
 authorized to receive for the recipient), please contact the sender by
 reply email and delete all copies of this message.

 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/index.html

 Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J
 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences
 http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe
 http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy
 http://www.cisco.com/web/siteassets/legal/privacy.html*



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* Tuesday, March 03, 2015 10:24 PM
 *To:* Nastooh Avessta (navesta)
 *Cc:* user@spark.apache.org
 *Subject:* Re: Spark Streaming Switchover Time



 Can you elaborate on what is this switchover time?



 TD



 On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) 
 nave...@cisco.com wrote:

 Hi

 On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in
 client mode, running a udp streaming application, I am noting around 2
 second elapse time on switchover, upon shutting down the streaming worker,
 where streaming window length is 1 sec. I am wondering what parameters are
 available to the developer to shorten this switchover time.

 Cheers,



 [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

 *Nastooh Avessta*
 ENGINEER.SOFTWARE ENGINEERING
 nave...@cisco.com
 Phone: *+1 604 647 1527 %2B1%20604%20647%201527*

 *Cisco Systems Limited*
 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
 VANCOUVER
 BRITISH COLUMBIA
 V7X 1J1
 CA
 Cisco.com http://www.cisco.com/



 [image: Think before you print.]Think before you print.

 This email may contain confidential and privileged material for the sole
 use of the intended recipient. Any review, use, distribution or disclosure
 by others is strictly prohibited. If you are not the intended recipient (or
 authorized to receive for the recipient), please contact the sender by
 reply email and delete all copies of this message.

 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/index.html

Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread Night Wolf
Tried with that. No luck. Same error on abt-interface jar. I can see maven
downloaded that jar into my .m2 cache

On Friday, March 6, 2015, 鹰 980548...@qq.com wrote:

 try it with mvn  -DskipTests -Pscala-2.11 clean install package


Store the shuffled files in memory using Tachyon

2015-03-06 Thread sara mustafa
Hi all,

Is it possible to store Spark shuffled files on a distributed memory like
Tachyon instead of spilling them to disk?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Store-the-shuffled-files-in-memory-using-Tachyon-tp21944.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-stream programme failed on yarn-client

2015-03-06 Thread Akhil Das
Looks like an issue with your yarn setup, could you try doing a simple
example with spark-shell?

Start the spark shell as:

$*MASTER=yarn-client bin/spark-shell*
*spark-shell *sc.parallelize(1 to 1000).collect

​If that doesn't work, then make sure your yarn services are up and running
and in your spark-env.sh you may set the corresponding configurations from
the following:​


# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512
Mb)


Thanks
Best Regards

On Fri, Mar 6, 2015 at 1:09 PM, fenghaixiong 980548...@qq.com wrote:

 Hi all,


 I'm try to write a spark stream programme so i read the spark
 online document ,according the document i write the programe like this :

 import org.apache.spark.SparkConf
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.StreamingContext._

 object SparkStreamTest {
   def main(args: Array[String]) {
 val conf = new SparkConf()
 val ssc = new StreamingContext(conf, Seconds(1))
 val lines = ssc.socketTextStream(args(0), args(1).toInt)
 val words = lines.flatMap(_.split( ))
 val pairs = words.map(word = (word, 1))
 val wordCounts = pairs.reduceByKey(_ + _)
 wordCounts.print()
 ssc.start() // Start the computation
 ssc.awaitTermination()  // Wait for the computation to terminate
   }

 }



 for test i first start listen a port by this:
  nc -lk 

 and then i submit job by
 spark-submit  --master local[2] --class com.nd.hxf.SparkStreamTest
 spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 

 everything is okay


 but when i run it on yarn by this :
 spark-submit  --master yarn-client --class com.nd.hxf.SparkStreamTest
 spark-test-tream-1.0-SNAPSHOT-job.jar  localhost 

 it wait for a longtime and repeat output somemessage a apart of the output
 is like this:







 15/03/06 15:30:24 INFO YarnClientSchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after waiting
 maxRegisteredResourcesWaitingTime: 3(ms)
 15/03/06 15:30:24 INFO ReceiverTracker: ReceiverTracker started
 15/03/06 15:30:24 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/06 15:30:24 INFO ShuffledDStream: metadataCleanupDelay = -1
 15/03/06 15:30:24 INFO MappedDStream: metadataCleanupDelay = -1
 15/03/06 15:30:24 INFO FlatMappedDStream: metadataCleanupDelay = -1
 15/03/06 15:30:24 INFO SocketInputDStream: metadataCleanupDelay = -1
 15/03/06 15:30:24 INFO SocketInputDStream: Slide time = 1000 ms
 15/03/06 15:30:24 INFO SocketInputDStream: Storage level =
 StorageLevel(false, false, false, false, 1)
 15/03/06 15:30:24 INFO SocketInputDStream: Checkpoint interval = null
 15/03/06 15:30:24 INFO SocketInputDStream: Remember duration = 1000 ms
 15/03/06 15:30:24 INFO SocketInputDStream: Initialized and validated
 org.apache.spark.streaming.dstream.SocketInputDStream@b01c5f8
 15/03/06 15:30:24 INFO FlatMappedDStream: Slide time = 1000 ms
 15/03/06 15:30:24 INFO FlatMappedDStream: Storage level =
 StorageLevel(false, false, false, false, 1)
 15/03/06 15:30:24 INFO FlatMappedDStream: Checkpoint interval = null
 15/03/06 15:30:24 INFO FlatMappedDStream: Remember duration = 1000 ms
 15/03/06 15:30:24 INFO FlatMappedDStream: Initialized and validated
 org.apache.spark.streaming.dstream.FlatMappedDStream@6bd47453
 15/03/06 15:30:24 INFO MappedDStream: Slide time = 1000 ms
 15/03/06 15:30:24 INFO MappedDStream: Storage level = StorageLevel(false,
 false, false, false, 1)
 15/03/06 15:30:24 INFO MappedDStream: Checkpoint interval = null
 15/03/06 15:30:24 INFO MappedDStream: Remember duration = 1000 ms
 15/03/06 15:30:24 INFO MappedDStream: Initialized and validated
 org.apache.spark.streaming.dstream.MappedDStream@941451f
 15/03/06 15:30:24 INFO ShuffledDStream: Slide time = 1000 ms
 15/03/06 15:30:24 INFO ShuffledDStream: Storage level =
 StorageLevel(false, false, false, false, 1)
 15/03/06 15:30:24 INFO ShuffledDStream: Checkpoint interval = null
 15/03/06 15:30:24 INFO ShuffledDStream: Remember duration = 1000 ms
 15/03/06 15:30:24 INFO ShuffledDStream: Initialized and validated
 org.apache.spark.streaming.dstream.ShuffledDStream@42eba6ee
 15/03/06 15:30:24 INFO ForEachDStream: Slide time = 1000 ms
 15/03/06 15:30:24 INFO ForEachDStream: Storage level = StorageLevel(false,
 false, false, false, 1)
 15/03/06 15:30:24 INFO ForEachDStream: Checkpoint interval = null
 15/03/06 15:30:24 INFO ForEachDStream: Remember duration = 1000 ms
 15/03/06 15:30:24 INFO ForEachDStream: Initialized and validated
 org.apache.spark.streaming.dstream.ForEachDStream@48d166b5
 15/03/06 15:30:24 INFO SparkContext: Starting job: start at
 SparkStreamTest.scala:21
 15/03/06 15:30:24 INFO