HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-03-06 Thread nitinkak001
I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest) conf.set(spark.executor.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);

Re: HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-03-06 Thread Marcelo Vanzin
On Fri, Mar 6, 2015 at 2:47 PM, nitinkak001 nitinkak...@gmail.com wrote: I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest) conf.set(spark.executor.extraClassPath,

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
i added it On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz brk...@gmail.com wrote: Hi Koert, Would you like to register this on spark-packages.org? Burak On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote: currently spark provides many excellent algorithms for operations

Spark streaming and executor object reusage

2015-03-06 Thread Jean-Pascal Billaud
Hi, Reading through the Spark Streaming Programming Guide, I read in the Design Patterns for using foreachRDD: Finally, this can be further optimized by reusing connection objects across multiple RDDs/batches. One can maintain a static pool of connection objects than can be reused as RDDs of

Re: Spark code development practice

2015-03-06 Thread Sean Owen
Hm, why do you expect a factory method over a constructor? no, you instantiate a SparkContext (if not working in the shell). When you write your own program, you parse your own command line args. --master yarn-client doesn't do anything unless you make it do so. That is an arg to *Spark*

Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread fenghaixiong
you can read this document : http://spark.apache.org/docs/latest/building-spark.html this might can solve you question and if you compile spark with maven you might need to set mave option like this befor you start compile it export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M

Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread Sean Owen
Are you letting Spark download and run zinc for you? maybe that copy is incomplete or corrupted. You can try removing the downloaded zinc from build/ and try again. Or run your own zinc. On Fri, Mar 6, 2015 at 7:51 AM, Night Wolf nightwolf...@gmail.com wrote: Hey, Trying to build latest spark

Re: Integer column in schema RDD from parquet being considered as string

2015-03-06 Thread gtinside
Hi tsingfu , Thanks for your reply, I tried with other columns but the problem is same with other Integer columns. Regards, Gaurav -- View this message in context:

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Found this thread: http://search-hadoop.com/m/JW1q5HMrge2 Cheers On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote: This was discussed in the past and viewed as dangerous to enable. The biggest problem, by far, comes when you have a job that output M partitions,

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Nan Zhu
Actually, except setting spark.hadoop.validateOutputSpecs to false to disable output validation for the whole program Spark implementation uses a Dynamic Variable (object PairRDDFunctions) internally to disable it in a case-by-case manner val disableOutputSpecValidation:

Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-06 Thread Zsolt Tóth
Hi, I submit spark jobs in yarn-cluster mode remotely from java code by calling Client.submitApplication(). For some reason I want to use 1.3.0 jars on the client side (e.g spark-yarn_2.10-1.3.0.jar) but I have spark-assembly-1.2.1* on the cluster. The problem is that the ApplicationMaster can't

Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Edmon Begoli
Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive Thank you, Edmon

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Sean Owen
This was discussed in the past and viewed as dangerous to enable. The biggest problem, by far, comes when you have a job that output M partitions, 'overwriting' a directory of data containing N M old partitions. You suddenly have a mix of new and old data. It doesn't match Hadoop's semantics

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Adding support for overwrite flag would make saveAsXXFile more user friendly. Cheers On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote: Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? --

Data Frame types

2015-03-06 Thread Cesar Flores
The SchemaRDD supports the storage of user defined classes. However, in order to do that, the user class needs to extends the UserDefinedType interface (see for example VectorUDT in org.apache.spark.mllib.linalg). My question is: Do the new Data Frame Structure (to be released in spark 1.3) will

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai
Hi Edmon, No, you do not need to install Hive to use Spark SQL. Thanks, Yin On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote: Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement:

[SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread James
Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION (date_key = ${hiveconf:CUR_DATE}); ``` when I execute ``` spark-sql -f script.hql -hiveconf CUR_DATE=20150119 ``` It throws an error like ``` cannot recognize input near

Re: LBGFS optimizer performace

2015-03-06 Thread Gustavo Enrique Salazar Torres
Hi there: Yeah, I came to that same conclusion after tuning spark sql shuffle parameter. Also cut out some classes I was using to parse my dataset and finally created schema only with the fields needed for my model (before that I was creating it with 63 fields while I just needed 15). So I came

Re: Optimizing SQL Query

2015-03-06 Thread daniel queiroz
Dude, please, attach the execution plan of the query and details about the indexes. 2015-03-06 9:07 GMT-03:00 anu anamika.guo...@gmail.com: I have a query that's like: Could you help in providing me pointers as to how to start to optimize it w.r.t. spark sql: sqlContext.sql( SELECT

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
Do you have a reference paper to the implemented algorithm in TSQR.scala ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet

spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations like combineByKey, reduceByKey and foldByKey rely on pushing the operation map-side so that the data reduce-side is small. and groupByKey simply

Re: No overwrite flag for saveAsXXFile

2015-03-06 Thread Ted Yu
Since we already have spark.hadoop.validateOutputSpecs config, I think there is not much need to expose disableOutputSpecValidation Cheers On Fri, Mar 6, 2015 at 7:34 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Actually, except setting spark.hadoop.validateOutputSpecs to false to disable

Re: Data Frame types

2015-03-06 Thread Jaonary Rabarisoa
Hi Cesar, Yes, you can define an UDT with the new DataFrame, the same way that SchemaRDD did. Jaonary On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote: The SchemaRDD supports the storage of user defined classes. However, in order to do that, the user class needs to

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Jaonary Rabarisoa
Hi Shivaram, Thank you for the link. I'm trying to figure out how can I port this to mllib. May you can help me to understand how pieces fit together. Currently, in mllib there's different types of distributed matrix : BlockMatrix, CoordinateMatrix, IndexedRowMatrix and RowMatrix. Which one

Re: Building Spark 1.3 for Scala 2.11 using Maven

2015-03-06 Thread Sean Owen
-Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this profile. Why are you running install package and not just install? Probably doesn't matter. This sounds like you are trying to only build core without building everything else, which you can't do in general unless you

Optimizing SQL Query

2015-03-06 Thread anu
I have a query that's like: Could you help in providing me pointers as to how to start to optimize it w.r.t. spark sql: sqlContext.sql( SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE FROM ( SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS SDP_USAGE

Re: spark-stream programme failed on yarn-client

2015-03-06 Thread fenghaixiong
Thanks ,you advise is usefull I just submit my job on my spark client which config with simple configure file so it failed when i run my job on service machine everything is okay On Fri, Mar 06, 2015 at 02:10:04PM +0530, Akhil Das wrote: Looks like an issue with your yarn setup, could you

No overwrite flag for saveAsXXFile

2015-03-06 Thread Jeff Zhang
Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? -- Best Regards Jeff Zhang

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Its not required, but even if you don't have hive installed you probably still want to use the HiveContext. From earlier in that doc: In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext.

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
First, thanks to everyone for their assistance and recommendations. @Marcelo I applied the patch that you recommended and am now able to get into the shell, thank you worked great after I realized that the pom was pointing to the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1. @Zhan

SparkSQL supports hive insert overwrite directory?

2015-03-06 Thread ogoh
Hello, I am using Spark 1.2.1 along with Hive 0.13.1. I run some hive queries by using beeline and Thriftserver. Queries I tested so far worked well except the followings: I want to export the query output into a file at either HDFS or local fs (ideally local fs). There are not yet supported? The

Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
Hi, I am filtering first DStream with the value in second DStream. I also want to keep the value of second Dstream. I have done the following and having problem with returning new RDD: val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: RDD[(String,String)], rdd2 : RDD[Int])

Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee
Hi Rares, If you dig into the descriptions for the two jobs, it will probably return something like: Job ID: 1 org.apache.spark.rdd.RDD.takeSample(RDD.scala:447) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) ... Job ID: 0

Re: Visualize Spark Job

2015-03-06 Thread Phuoc Do
I have this PR submitted. You can merge it and try. https://github.com/apache/spark/pull/2077 On Thu, Jan 15, 2015 at 12:50 AM, Kuromatsu, Nobuyuki n.kuroma...@jp.fujitsu.com wrote: Hi I want to visualize tasks and stages in order to analyze spark jobs. I know necessary metrics is written

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread Zhan Zhang
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash) Thanks. Zhan Zhang On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote: Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION

takeSample triggers 2 jobs

2015-03-06 Thread Rares Vernica
Hello, I am using takeSample from the Scala Spark 1.2.1 shell: scala sc.textFile(README.md).takeSample(false, 3) and I notice that two jobs are generated on the Spark Jobs page: Job Id Description 1 takeSample at console:13 0 takeSample at console:13 Any ideas why the two jobs are needed?

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread sandeep vura
Hi , For creating a Hive table do i need to add hive-site.xml in spark/conf directory. On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com wrote: Its not required, but even if you don't have hive installed you probably still want to use the HiveContext. From earlier in

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-06 Thread Shivaram Venkataraman
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a good reference Shivaram On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Do you have a reference paper to the implemented algorithm in TSQR.scala ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Hi Todd, Looks like the thrift server can connect to metastore, but something wrong in the executors. You can try to get the log with yarn logs -applicationID xxx” to check why it failed. If there is no log (master or executor is not started at all), you can go to the RM webpage, click the

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura sandeepv...@gmail.com wrote: Can i get document how to create that setup .i mean i need hive integration on spark http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
Hi Zhan, I applied the patch you recommended, https://github.com/apache/spark/pull/3409, it it now works. It was failing with this: Exception message: /hadoop/yarn/local/usercache/root/appcache/application_1425078697953_0020/container_1425078697953_0020_01_02/launch_container.sh: line 14:

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
Sorry. Misunderstanding. Looks like it already worked. If you still met some hdp.version problem, you can try it :) Thanks. Zhan Zhang On Mar 6, 2015, at 11:40 AM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: You are using 1.2.1 right? If so, please add java-opts in

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Zhan Zhang
You are using 1.2.1 right? If so, please add java-opts in conf directory and give it a try. [root@c6401 conf]# more java-opts -Dhdp.version=2.2.2.0-2041 Thanks. Zhan Zhang On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote:

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-06 Thread Todd Nist
Working great now, after applying that patch; thanks again. On Fri, Mar 6, 2015 at 2:42 PM, Zhan Zhang zzh...@hortonworks.com wrote: Sorry. Misunderstanding. Looks like it already worked. If you still met some hdp.version problem, you can try it :) Thanks. Zhan Zhang On Mar 6, 2015,

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
Hi Koert, Would you like to register this on spark-packages.org? Burak On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote: currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
Only if you want to configure the connection to an existing hive metastore. On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura sandeepv...@gmail.com wrote: Hi , For creating a Hive table do i need to add hive-site.xml in spark/conf directory. On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust

Re: Data Frame types

2015-03-06 Thread Michael Armbrust
No, the UDT API is not a public API as we have not stabilized the implementation. For this reason its only accessible to projects inside of Spark. On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi Cesar, Yes, you can define an UDT with the new DataFrame, the same

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Michael Armbrust
On Fri, Mar 6, 2015 at 11:56 AM, sandeep vura sandeepv...@gmail.com wrote: Yes i want to link with existing hive metastore. Is that the right way to link to hive metastore . Yes.

Re: Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
Yes this is the problem. I want to return an RDD but it is abstract and I cannot instantiate it. So what are other options. I have two streams and I want to filter this stream on the basis of other and also want keep the value of other stream. I have also tried join. But one stream has more

Re: Spark Streaming Switchover Time

2015-03-06 Thread Tathagata Das
It is probably the time taken by the system to figure out that the worker is down. Could you see in the logs to find what goes on when you kill the worker? TD On Wed, Mar 4, 2015 at 6:20 AM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Indeed. And am wondering if this switchover time

Re: Compile Spark with Maven Zinc Scala Plugin

2015-03-06 Thread Night Wolf
Tried with that. No luck. Same error on abt-interface jar. I can see maven downloaded that jar into my .m2 cache On Friday, March 6, 2015, 鹰 980548...@qq.com wrote: try it with mvn -DskipTests -Pscala-2.11 clean install package

Store the shuffled files in memory using Tachyon

2015-03-06 Thread sara mustafa
Hi all, Is it possible to store Spark shuffled files on a distributed memory like Tachyon instead of spilling them to disk? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Store-the-shuffled-files-in-memory-using-Tachyon-tp21944.html Sent from the Apache

Re: spark-stream programme failed on yarn-client

2015-03-06 Thread Akhil Das
Looks like an issue with your yarn setup, could you try doing a simple example with spark-shell? Start the spark shell as: $*MASTER=yarn-client bin/spark-shell* *spark-shell *sc.parallelize(1 to 1000).collect ​If that doesn't work, then make sure your yarn services are up and running and in