HiveContext test, Spark Context did not initialize after waiting 10000ms
I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest) conf.set(spark.executor.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib); conf.set(spark.driver.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib); conf.set(spark.yarn.am.waitTime, 30L) val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) def inputRDD = sqlContext.sql(describe spark_poc.src_digital_profile_user); inputRDD.collect().foreach { println } println(inputRDD.schema.getClass.getName) / Getting this exception. Any clues? The weird part is if I try to do the same thing but in Java instead of Scala, it runs fine. /Exception in thread Driver java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) 15/03/06 17:39:32 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 1 ms. Please check earlier log output for errors. Failing the application. Exception in thread main java.lang.NullPointerException at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:218) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:110) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:434) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:433) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/03/06 17:39:32 INFO yarn.ApplicationMaster: AppMaster received a signal./ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-test-Spark-Context-did-not-initialize-after-waiting-1ms-tp21953.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HiveContext test, Spark Context did not initialize after waiting 10000ms
On Fri, Mar 6, 2015 at 2:47 PM, nitinkak001 nitinkak...@gmail.com wrote: I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest) conf.set(spark.executor.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib); conf.set(spark.driver.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib); conf.set(spark.yarn.am.waitTime, 30L) You're missing /* at the end of your classpath entries. Also, since you're on CDH 5.2, you'll probably need to filter out the guava jar from Hive's lib directory, otherwise things might break. So things will get a little more complicated. With CDH 5.3 you shouldn't need to filter out the guava jar. -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-sorted, or secondary sort and streaming reduce for spark
i added it On Fri, Mar 6, 2015 at 2:40 PM, Burak Yavuz brk...@gmail.com wrote: Hi Koert, Would you like to register this on spark-packages.org? Burak On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote: currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations like combineByKey, reduceByKey and foldByKey rely on pushing the operation map-side so that the data reduce-side is small. and groupByKey simply requires that the values per key fit in memory. but there are algorithms for which we would like to process all the values per key reduce-side, even when they do not fit in memory. examples are algorithms that need to process the values ordered, or algorithms that need to emit all values again. basically this is what the original hadoop reduce operation did so well: it allowed sorting of values (using secondary sort), and it processed all values per key in a streaming fashion. the library spark-sorted aims to bring these kind of operations back to spark, by providing a way to process values with a user provided Ordering[V] and a user provided streaming operation Iterator[V] = Iterator[W]. it does not make the assumption that the values need to fit in memory per key. the basic idea is to rely on spark's sort-based shuffle to re-arrange the data so that all values for a given key are placed consecutively within a single partition, and then process them using a map-like operation. you can find the project here: https://github.com/tresata/spark-sorted the project is in a very early stage. any feedback is very much appreciated.
Spark streaming and executor object reusage
Hi, Reading through the Spark Streaming Programming Guide, I read in the Design Patterns for using foreachRDD: Finally, this can be further optimized by reusing connection objects across multiple RDDs/batches. One can maintain a static pool of connection objects than can be reused as RDDs of multiple batches are pushed to the external system I have this connection pool that might be more or less heavy to instantiate. I don't use it as part of a foreachRDD but as part of regular map operations to query some api service. I'd like to understand what multiple batches means here. Is this across RDDs on a single DStream? Across multiple DStreams? I'd like to understand what's the context sharability across DStreams over time. Is it expected that the executor initializing my Factory will keep getting batches from my streaming job while using the same singleton connection pool over and over? Or Spark resets executors states after each DStream is completed to allocated executors to other streaming job potentially? Thanks,
Re: Spark code development practice
Hm, why do you expect a factory method over a constructor? no, you instantiate a SparkContext (if not working in the shell). When you write your own program, you parse your own command line args. --master yarn-client doesn't do anything unless you make it do so. That is an arg to *Spark* programs. In many cases these just end up setting a value on SparkConf. Not always though. you would need to trace through programs like SparkSubmit to see how to emulate the effect of some args. Unless you really need to do this, you should not. Write an app that you run with spark-submit. On Fri, Mar 6, 2015 at 2:56 AM, Xi Shen davidshe...@gmail.com wrote: Thanks guys, this is very useful :) @Stephen, I know spark-shell will create a SC for me. But I don't understand why we still need to do new SparkContext(...) in our code. Shouldn't we get it from some where? e.g. SparkContext.get. Another question, if I want my spark code to run in YARN later, how should I create the SparkContext? Or I can just specify --marst yarn on command line? Thanks, David On Fri, Mar 6, 2015 at 12:38 PM Koen Vantomme koen.vanto...@gmail.com wrote: use the spark-shell command and the shell will open type :paste abd then paste your code, after control-d open spark-shell: sparks/bin ./spark-shell Verstuurd vanaf mijn iPhone Op 6-mrt.-2015 om 02:28 heeft fightf...@163.com fightf...@163.com het volgende geschreven: Hi, You can first establish a scala ide to develop and debug your spark program, lets say, intellij idea or eclipse. Thanks, Sun. fightf...@163.com From: Xi Shen Date: 2015-03-06 09:19 To: user@spark.apache.org Subject: Spark code development practice Hi, I am new to Spark. I see every spark program has a main() function. I wonder if I can run the spark program directly, without using spark-submit. I think it will be easier for early development and debug. Thanks, David - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compile Spark with Maven Zinc Scala Plugin
you can read this document : http://spark.apache.org/docs/latest/building-spark.html this might can solve you question and if you compile spark with maven you might need to set mave option like this befor you start compile it export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m thanks On Fri, Mar 06, 2015 at 07:10:34PM +1100, Night Wolf wrote: Tried with that. No luck. Same error on abt-interface jar. I can see maven downloaded that jar into my .m2 cache On Friday, March 6, 2015, 鹰 980548...@qq.com wrote: try it with mvn -DskipTests -Pscala-2.11 clean install package - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compile Spark with Maven Zinc Scala Plugin
Are you letting Spark download and run zinc for you? maybe that copy is incomplete or corrupted. You can try removing the downloaded zinc from build/ and try again. Or run your own zinc. On Fri, Mar 6, 2015 at 7:51 AM, Night Wolf nightwolf...@gmail.com wrote: Hey, Trying to build latest spark 1.3 with Maven using -DskipTests clean install package But I'm getting errors with zinc, in the logs I see; [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-network-common_2.11 --- ... [error] Required file not found: sbt-interface.jar [error] See zinc -help for information about locating necessary files Any ideas? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Integer column in schema RDD from parquet being considered as string
Hi tsingfu , Thanks for your reply, I tried with other columns but the problem is same with other Integer columns. Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Integer-column-in-schema-RDD-from-parquet-being-considered-as-string-tp21917p21950.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No overwrite flag for saveAsXXFile
Found this thread: http://search-hadoop.com/m/JW1q5HMrge2 Cheers On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote: This was discussed in the past and viewed as dangerous to enable. The biggest problem, by far, comes when you have a job that output M partitions, 'overwriting' a directory of data containing N M old partitions. You suddenly have a mix of new and old data. It doesn't match Hadoop's semantics either, which won't let you do this. You can of course simply remove the output directory. On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote: Adding support for overwrite flag would make saveAsXXFile more user friendly. Cheers On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote: Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? -- Best Regards Jeff Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No overwrite flag for saveAsXXFile
Actually, except setting spark.hadoop.validateOutputSpecs to false to disable output validation for the whole program Spark implementation uses a Dynamic Variable (object PairRDDFunctions) internally to disable it in a case-by-case manner val disableOutputSpecValidation: DynamicVariable[Boolean] = new DynamicVariable[Boolean](false) I’m not sure if there is enough amount of benefits to make it worth exposing this variable to the user… Best, -- Nan Zhu http://codingcat.me On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote: Found this thread: http://search-hadoop.com/m/JW1q5HMrge2 Cheers On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com (mailto:so...@cloudera.com) wrote: This was discussed in the past and viewed as dangerous to enable. The biggest problem, by far, comes when you have a job that output M partitions, 'overwriting' a directory of data containing N M old partitions. You suddenly have a mix of new and old data. It doesn't match Hadoop's semantics either, which won't let you do this. You can of course simply remove the output directory. On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com (mailto:yuzhih...@gmail.com) wrote: Adding support for overwrite flag would make saveAsXXFile more user friendly. Cheers On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com (mailto:zjf...@gmail.com) wrote: Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? -- Best Regards Jeff Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode
Hi, I submit spark jobs in yarn-cluster mode remotely from java code by calling Client.submitApplication(). For some reason I want to use 1.3.0 jars on the client side (e.g spark-yarn_2.10-1.3.0.jar) but I have spark-assembly-1.2.1* on the cluster. The problem is that the ApplicationMaster can't find the user application jar (specified with --jar option). I think this is because of changes in the classpath population in the Client class. Is it possible to make this setup work without changing the codebase or the jars? Cheers, Zsolt
Spark-SQL and Hive - is Hive required?
Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive Thank you, Edmon
Re: No overwrite flag for saveAsXXFile
This was discussed in the past and viewed as dangerous to enable. The biggest problem, by far, comes when you have a job that output M partitions, 'overwriting' a directory of data containing N M old partitions. You suddenly have a mix of new and old data. It doesn't match Hadoop's semantics either, which won't let you do this. You can of course simply remove the output directory. On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote: Adding support for overwrite flag would make saveAsXXFile more user friendly. Cheers On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote: Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? -- Best Regards Jeff Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No overwrite flag for saveAsXXFile
Adding support for overwrite flag would make saveAsXXFile more user friendly. Cheers On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote: Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? -- Best Regards Jeff Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Data Frame types
The SchemaRDD supports the storage of user defined classes. However, in order to do that, the user class needs to extends the UserDefinedType interface (see for example VectorUDT in org.apache.spark.mllib.linalg). My question is: Do the new Data Frame Structure (to be released in spark 1.3) will be able to handle user defined classes too? Do user classes will need to extend they will need to define the same approach? -- Cesar Flores
Re: Spark-SQL and Hive - is Hive required?
Hi Edmon, No, you do not need to install Hive to use Spark SQL. Thanks, Yin On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote: Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive Thank you, Edmon
[SPARK-SQL] How to pass parameter when running hql script using cli?
Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION (date_key = ${hiveconf:CUR_DATE}); ``` when I execute ``` spark-sql -f script.hql -hiveconf CUR_DATE=20150119 ``` It throws an error like ``` cannot recognize input near '$' '{' 'hiveconf' in constant ``` I have try on hive and it works. Thus how could I pass a parameter like date to a hql script? Alcaid
Re: LBGFS optimizer performace
Hi there: Yeah, I came to that same conclusion after tuning spark sql shuffle parameter. Also cut out some classes I was using to parse my dataset and finally created schema only with the fields needed for my model (before that I was creating it with 63 fields while I just needed 15). So I came with this set of parameters: --num-executors 200 --executor-memory 800M --conf spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+AggressiveOpts --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.storage.memoryFraction=0.3 --conf spark.rdd.compress=true --conf spark.sql.shuffle.partitions=4000 --driver-memory 4G Now I processed 270 GB in 35 minutes and no OOM errors. I have one question though: Does Spark SQL handle skewed tables? I was wondering about that since my data has that feature and maybe there is more room for performance improvement. Thanks again. Gustavo On Thu, Mar 5, 2015 at 6:45 PM, DB Tsai dbt...@dbtsai.com wrote: PS, I will recommend you compress the data when you cache the RDD. There will be some overhead in compression/decompression, and serialization/deserialization, but it will help a lot for iterative algorithms with ability to caching more data. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, Mar 3, 2015 at 2:27 PM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Yeah, I can call count before that and it works. Also I was over caching tables but I removed those. Now there is no caching but it gets really slow since it calculates my table RDD many times. Also hacked the LBFGS code to pass the number of examples which I calculated outside in a Spark SQL query but just moved the location of the problem. The query I'm running looks like this: sSELECT $mappedFields, tableB.id as b_id FROM tableA LEFT JOIN tableB ON tableA.id=tableB.table_a_id WHERE tableA.status='' OR tableB.status='' mappedFields contains a list of fields which I'm interested in. The result of that query goes through (including sampling) some transformations before being input to LBFGS. My dataset has 180GB just for feature selection, I'm planning to use 450GB to train the final model and I'm using 16 c3.2xlarge EC2 instances, that means I have 240GB of RAM available. Any suggestion? I'm starting to check the algorithm because I don't understand why it needs to count the dataset. Thanks Gustavo On Tue, Mar 3, 2015 at 6:08 PM, Joseph Bradley jos...@databricks.com wrote: Is that error actually occurring in LBFGS? It looks like it might be happening before the data even gets to LBFGS. (Perhaps the outer join you're trying to do is making the dataset size explode a bit.) Are you able to call count() (or any RDD action) on the data before you pass it to LBFGS? On Tue, Mar 3, 2015 at 8:55 AM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Just did with the same error. I think the problem is the data.count() call in LBFGS because for huge datasets that's naive to do. I was thinking to write my version of LBFGS but instead of doing data.count() I will pass that parameter which I will calculate from a Spark SQL query. I will let you know. Thanks On Tue, Mar 3, 2015 at 3:25 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try increasing your driver memory, reducing the executors and increasing the executor memory? Thanks Best Regards On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Hi there: I'm using LBFGS optimizer to train a logistic regression model. The code I implemented follows the pattern showed in https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but training data is obtained from a Spark SQL RDD. The problem I'm having is that LBFGS tries to count the elements in my RDD and that results in a OOM exception since my dataset is huge. I'm running on a AWS EMR cluster with 16 c3.2xlarge instances on Hadoop YARN. My dataset is about 150 GB but I sample (I take only 1% of the data) it in order to scale logistic regression. The exception I'm getting is this: 15/03/03 04:21:44 WARN scheduler.TaskSetManager: Lost task 108.0 in stage 2.0 (TID 7600, ip-10-155-20-71.ec2.internal): java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:2694) at java.lang.String.init(String.java:203) at com.esotericsoftware.kryo.io.Input.readString(Input.java:448) at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157) at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at
Re: Optimizing SQL Query
Dude, please, attach the execution plan of the query and details about the indexes. 2015-03-06 9:07 GMT-03:00 anu anamika.guo...@gmail.com: I have a query that's like: Could you help in providing me pointers as to how to start to optimize it w.r.t. spark sql: sqlContext.sql( SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE FROM ( SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS SDP_USAGE FROM ( SELECT * FROM date_d dd JOIN interval_f intf ON intf.DATE_WID = dd.WID WHERE intf.DATE_WID = 20141116 AND intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) = '2014-11-16 00:00:00.000' AND CAST(INTERVAL_END_TIME AS STRING) = '2014-11-26 00:00:00.000' AND MEAS_WID = 3 ) test JOIN sdp_d sdp ON test.SDP_WID = sdp.WID WHERE sdp.UDC_ID = 'SP-1931201848' GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID ) dw GROUP BY dw.DAY_OF_WEEK, dw.HOUR) Currently the query takes 15 minutes execution time where interval_f table holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
Do you have a reference paper to the implemented algorithm in TSQR.scala ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
spark-sorted, or secondary sort and streaming reduce for spark
currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations like combineByKey, reduceByKey and foldByKey rely on pushing the operation map-side so that the data reduce-side is small. and groupByKey simply requires that the values per key fit in memory. but there are algorithms for which we would like to process all the values per key reduce-side, even when they do not fit in memory. examples are algorithms that need to process the values ordered, or algorithms that need to emit all values again. basically this is what the original hadoop reduce operation did so well: it allowed sorting of values (using secondary sort), and it processed all values per key in a streaming fashion. the library spark-sorted aims to bring these kind of operations back to spark, by providing a way to process values with a user provided Ordering[V] and a user provided streaming operation Iterator[V] = Iterator[W]. it does not make the assumption that the values need to fit in memory per key. the basic idea is to rely on spark's sort-based shuffle to re-arrange the data so that all values for a given key are placed consecutively within a single partition, and then process them using a map-like operation. you can find the project here: https://github.com/tresata/spark-sorted the project is in a very early stage. any feedback is very much appreciated.
Re: No overwrite flag for saveAsXXFile
Since we already have spark.hadoop.validateOutputSpecs config, I think there is not much need to expose disableOutputSpecValidation Cheers On Fri, Mar 6, 2015 at 7:34 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Actually, except setting spark.hadoop.validateOutputSpecs to false to disable output validation for the whole program Spark implementation uses a Dynamic Variable (object PairRDDFunctions) internally to disable it in a case-by-case manner val disableOutputSpecValidation: DynamicVariable[Boolean] = new DynamicVariable[Boolean](false) I’m not sure if there is enough amount of benefits to make it worth exposing this variable to the user… Best, -- Nan Zhu http://codingcat.me On Friday, March 6, 2015 at 10:22 AM, Ted Yu wrote: Found this thread: http://search-hadoop.com/m/JW1q5HMrge2 Cheers On Fri, Mar 6, 2015 at 6:42 AM, Sean Owen so...@cloudera.com wrote: This was discussed in the past and viewed as dangerous to enable. The biggest problem, by far, comes when you have a job that output M partitions, 'overwriting' a directory of data containing N M old partitions. You suddenly have a mix of new and old data. It doesn't match Hadoop's semantics either, which won't let you do this. You can of course simply remove the output directory. On Fri, Mar 6, 2015 at 2:20 PM, Ted Yu yuzhih...@gmail.com wrote: Adding support for overwrite flag would make saveAsXXFile more user friendly. Cheers On Mar 6, 2015, at 2:14 AM, Jeff Zhang zjf...@gmail.com wrote: Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? -- Best Regards Jeff Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Data Frame types
Hi Cesar, Yes, you can define an UDT with the new DataFrame, the same way that SchemaRDD did. Jaonary On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote: The SchemaRDD supports the storage of user defined classes. However, in order to do that, the user class needs to extends the UserDefinedType interface (see for example VectorUDT in org.apache.spark.mllib.linalg). My question is: Do the new Data Frame Structure (to be released in spark 1.3) will be able to handle user defined classes too? Do user classes will need to extend they will need to define the same approach? -- Cesar Flores
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
Hi Shivaram, Thank you for the link. I'm trying to figure out how can I port this to mllib. May you can help me to understand how pieces fit together. Currently, in mllib there's different types of distributed matrix : BlockMatrix, CoordinateMatrix, IndexedRowMatrix and RowMatrix. Which one should correspond to RowPartitionedMatrix in ml-matrix ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: Building Spark 1.3 for Scala 2.11 using Maven
-Pscala-2.11 and -Dscala-2.11 will happen to do the same thing for this profile. Why are you running install package and not just install? Probably doesn't matter. This sounds like you are trying to only build core without building everything else, which you can't do in general unless you already built and installed these snapshot artifacts locally. On Fri, Mar 6, 2015 at 12:46 AM, Night Wolf nightwolf...@gmail.com wrote: Hey guys, Trying to build Spark 1.3 for Scala 2.11. I'm running with the folllowng Maven command; -DskipTests -Dscala-2.11 clean install package Exception: [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dependencies for project org.apache.spark:spark-core_2.10:jar:1.3.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT: Failure to find org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT in http://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced - [Help 1] I see these warnings in the log before this error: [INFO] [INFO] [INFO] Building Spark Project Core 1.3.0-SNAPSHOT [INFO] [WARNING] The POM for org.apache.spark:spark-network-common_2.11:jar:1.3.0-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-network-shuffle_2.11:jar:1.3.0-SNAPSHOT is missing, no dependency information available Any ideas? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Optimizing SQL Query
I have a query that's like: Could you help in providing me pointers as to how to start to optimize it w.r.t. spark sql: sqlContext.sql( SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE FROM ( SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS SDP_USAGE FROM ( SELECT * FROM date_d dd JOIN interval_f intf ON intf.DATE_WID = dd.WID WHERE intf.DATE_WID = 20141116 AND intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) = '2014-11-16 00:00:00.000' AND CAST(INTERVAL_END_TIME AS STRING) = '2014-11-26 00:00:00.000' AND MEAS_WID = 3 ) test JOIN sdp_d sdp ON test.SDP_WID = sdp.WID WHERE sdp.UDC_ID = 'SP-1931201848' GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID ) dw GROUP BY dw.DAY_OF_WEEK, dw.HOUR) Currently the query takes 15 minutes execution time where interval_f table holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-stream programme failed on yarn-client
Thanks ,you advise is usefull I just submit my job on my spark client which config with simple configure file so it failed when i run my job on service machine everything is okay On Fri, Mar 06, 2015 at 02:10:04PM +0530, Akhil Das wrote: Looks like an issue with your yarn setup, could you try doing a simple example with spark-shell? Start the spark shell as: $*MASTER=yarn-client bin/spark-shell* *spark-shell *sc.parallelize(1 to 1000).collect If that doesn't work, then make sure your yarn services are up and running and in your spark-env.sh you may set the corresponding configurations from the following: # Options read in YARN client mode # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files # - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2) # - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1). # - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G) # - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb) Thanks Best Regards On Fri, Mar 6, 2015 at 1:09 PM, fenghaixiong 980548...@qq.com wrote: Hi all, I'm try to write a spark stream programme so i read the spark online document ,according the document i write the programe like this : import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object SparkStreamTest { def main(args: Array[String]) { val conf = new SparkConf() val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream(args(0), args(1).toInt) val words = lines.flatMap(_.split( )) val pairs = words.map(word = (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate } } for test i first start listen a port by this: nc -lk and then i submit job by spark-submit --master local[2] --class com.nd.hxf.SparkStreamTest spark-test-tream-1.0-SNAPSHOT-job.jar localhost everything is okay but when i run it on yarn by this : spark-submit --master yarn-client --class com.nd.hxf.SparkStreamTest spark-test-tream-1.0-SNAPSHOT-job.jar localhost it wait for a longtime and repeat output somemessage a apart of the output is like this: 15/03/06 15:30:24 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 3(ms) 15/03/06 15:30:24 INFO ReceiverTracker: ReceiverTracker started 15/03/06 15:30:24 INFO ForEachDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO ShuffledDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO MappedDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO FlatMappedDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO SocketInputDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO SocketInputDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO SocketInputDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO SocketInputDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO SocketInputDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO SocketInputDStream: Initialized and validated org.apache.spark.streaming.dstream.SocketInputDStream@b01c5f8 15/03/06 15:30:24 INFO FlatMappedDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO FlatMappedDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO FlatMappedDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO FlatMappedDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO FlatMappedDStream: Initialized and validated org.apache.spark.streaming.dstream.FlatMappedDStream@6bd47453 15/03/06 15:30:24 INFO MappedDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO MappedDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO MappedDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream@941451f 15/03/06 15:30:24 INFO ShuffledDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO ShuffledDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO ShuffledDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO ShuffledDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO ShuffledDStream: Initialized and validated org.apache.spark.streaming.dstream.ShuffledDStream@42eba6ee 15/03/06 15:30:24 INFO ForEachDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06
No overwrite flag for saveAsXXFile
Hi folks, I found that RDD:saveXXFile has no overwrite flag which I think is very helpful. Is there any reason for this ? -- Best Regards Jeff Zhang
Re: Spark-SQL and Hive - is Hive required?
Its not required, but even if you don't have hive installed you probably still want to use the HiveContext. From earlier in that doc: In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the ability to read data from Hive tables. To use a HiveContext, *you do not need to have an existing Hive setup*, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build. If these dependencies are not a problem for your application then using HiveContext is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to feature parity with a HiveContext. On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote: Hi Edmon, No, you do not need to install Hive to use Spark SQL. Thanks, Yin On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote: Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive Thank you, Edmon
Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
First, thanks to everyone for their assistance and recommendations. @Marcelo I applied the patch that you recommended and am now able to get into the shell, thank you worked great after I realized that the pom was pointing to the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1. @Zhan Need to apply this patch next. I tried to start the spark-thriftserver but and it starts, then fails with like this: I have the entries in my spark-default.conf, but not the patch applied. ./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m --hiveconf hive.server2.thrift.port=10001 5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO impl.TimelineClientImpl: Timeline service address: http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06 12:34:18 INFO client.RMProxy: Connecting to ResourceManager at hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead15/03/06 12:34:18 INFO yarn.Client: Setting up container launch context for our AM15/03/06 12:34:18 INFO yarn.Client: Preparing resources for our AM container15/03/06 12:34:19 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.15/03/06 12:34:19 INFO yarn.Client: Uploading resource file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar - hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06 12:34:21 INFO yarn.Client: Setting up the launch environment for our AM container15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: root15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: root15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)15/03/06 12:34:21 INFO yarn.Client: Submitting application 18 to ResourceManager15/03/06 12:34:21 INFO impl.YarnClientImpl: Submitted application application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:22 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1425663261755 final status: UNDEFINED tracking URL: http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/ user: root15/03/06 12:34:23 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:24 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:25 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:26 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763]15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS - hadoopdev02.opsdatastore.com, PROXY_URI_BASES - http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018), /proxy/application_1425078697953_001815/03/06 12:34:27 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter15/03/06 12:34:27 INFO yarn.Client: Application report for application_1425078697953_0018 (state: RUNNING)15/03/06 12:34:27 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: hadoopdev08.opsdatastore.com ApplicationMaster RPC port: 0 queue: default start time: 1425663261755 final status: UNDEFINED tracking URL: http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/ user: root15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: Application application_1425078697953_0018 has started running.15/03/06 12:34:28 INFO netty.NettyBlockTransferService: Server created on 4612415/03/06 12:34:28 INFO storage.BlockManagerMaster: Trying to register BlockManager15/03/06 12:34:28 INFO storage.BlockManagerMasterActor: Registering block manager hadoopdev01.opsdatastore.com:46124 with 265.4 MB RAM, BlockManagerId(driver, hadoopdev01.opsdatastore.com, 46124)15/03/06 12:34:28 INFO storage.BlockManagerMaster: Registered
SparkSQL supports hive insert overwrite directory?
Hello, I am using Spark 1.2.1 along with Hive 0.13.1. I run some hive queries by using beeline and Thriftserver. Queries I tested so far worked well except the followings: I want to export the query output into a file at either HDFS or local fs (ideally local fs). There are not yet supported? The spark github has already unit tests using insert overwrite directory in https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/ql/src/test/queries/clientpositive/insert_overwrite_local_directory_1.q. $insert overwrite directory 'hdfs directory name' select * from temptable; TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME temptable TOK_INSERT TOK_DESTINATION TOK_DIR '/user/ogoh/table' TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF scala.NotImplementedError: No parse rules for: TOK_DESTINATION TOK_DIR '/user/bob/table' $insert overwrite local directory 'hdfs directory name' select * from temptable; ; TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME temptable TOK_INSERT TOK_DESTINATION TOK_LOCAL_DIR /user/bob/table TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF scala.NotImplementedError: No parse rules for: TOK_DESTINATION TOK_LOCAL_DIR /user/ogoh/table Thanks, Okehee -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-supports-hive-insert-overwrite-directory-tp21951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Help with transformWith in SparkStreaming
Hi, I am filtering first DStream with the value in second DStream. I also want to keep the value of second Dstream. I have done the following and having problem with returning new RDD: val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: RDD[(String,String)], rdd2 : RDD[Int]) = { var first = ; var second = ; var third = 0 if (rdd2.first=3) { first = rdd1.map(_._1).first second = rdd1.map(_._2).first third = rdd2.first } RDD[(first,second,third)] }) ERROR/home/hduser/Projects/scalaad/src/main/scala/eeg/anomd/StreamAnomalyDetector.scala:119: error: not found: value RDD[ERROR] RDD[(first,second,third)] I am imported the import org.apache.spark.rdd.RDD Regards,Laeeq
Re: takeSample triggers 2 jobs
Hi Rares, If you dig into the descriptions for the two jobs, it will probably return something like: Job ID: 1 org.apache.spark.rdd.RDD.takeSample(RDD.scala:447) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) ... Job ID: 0 org.apache.spark.rdd.RDD.takeSample(RDD.scala:428) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) ... The code for Spark from the git copy of master at: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala Basically, line 428 refers to val initialCount = this.count() And liine 447 refers to var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect() Basically, the first job is getting the count so you can do the second job which is to generate the samples. HTH! Denny On Fri, Mar 6, 2015 at 10:44 AM Rares Vernica rvern...@gmail.com wrote: Hello, I am using takeSample from the Scala Spark 1.2.1 shell: scala sc.textFile(README.md).takeSample(false, 3) and I notice that two jobs are generated on the Spark Jobs page: Job Id Description 1 takeSample at console:13 0 takeSample at console:13 Any ideas why the two jobs are needed? Thanks! Rares
Re: Visualize Spark Job
I have this PR submitted. You can merge it and try. https://github.com/apache/spark/pull/2077 On Thu, Jan 15, 2015 at 12:50 AM, Kuromatsu, Nobuyuki n.kuroma...@jp.fujitsu.com wrote: Hi I want to visualize tasks and stages in order to analyze spark jobs. I know necessary metrics is written in spark.eventLog.dir. Does anyone know the tool like swimlanes in Tez? Regards, Nobuyuki Kuromatsu - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Phuoc Do https://vida.io/dnprock
Re: [SPARK-SQL] How to pass parameter when running hql script using cli?
Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash) Thanks. Zhan Zhang On Mar 6, 2015, at 4:20 AM, James alcaid1...@gmail.com wrote: Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION (date_key = ${hiveconf:CUR_DATE}); ``` when I execute ``` spark-sql -f script.hql -hiveconf CUR_DATE=20150119 ``` It throws an error like ``` cannot recognize input near '$' '{' 'hiveconf' in constant ``` I have try on hive and it works. Thus how could I pass a parameter like date to a hql script? Alcaid - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
takeSample triggers 2 jobs
Hello, I am using takeSample from the Scala Spark 1.2.1 shell: scala sc.textFile(README.md).takeSample(false, 3) and I notice that two jobs are generated on the Spark Jobs page: Job Id Description 1 takeSample at console:13 0 takeSample at console:13 Any ideas why the two jobs are needed? Thanks! Rares
Re: Spark-SQL and Hive - is Hive required?
Hi , For creating a Hive table do i need to add hive-site.xml in spark/conf directory. On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com wrote: Its not required, but even if you don't have hive installed you probably still want to use the HiveContext. From earlier in that doc: In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the ability to read data from Hive tables. To use a HiveContext, *you do not need to have an existing Hive setup*, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build. If these dependencies are not a problem for your application then using HiveContext is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to feature parity with a HiveContext. On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote: Hi Edmon, No, you do not need to install Hive to use Spark SQL. Thanks, Yin On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote: Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive Thank you, Edmon
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
Section 3, 4, 5 in http://www.netlib.org/lapack/lawnspdf/lawn204.pdf is a good reference Shivaram On Mar 6, 2015 9:17 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Do you have a reference paper to the implemented algorithm in TSQR.scala ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
Hi Todd, Looks like the thrift server can connect to metastore, but something wrong in the executors. You can try to get the log with yarn logs -applicationID xxx” to check why it failed. If there is no log (master or executor is not started at all), you can go to the RM webpage, click the link to see why the shell failed in the first place. Thanks. Zhan Zhang On Mar 6, 2015, at 9:59 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote: First, thanks to everyone for their assistance and recommendations. @Marcelo I applied the patch that you recommended and am now able to get into the shell, thank you worked great after I realized that the pom was pointing to the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1. @Zhan Need to apply this patch next. I tried to start the spark-thriftserver but and it starts, then fails with like this: I have the entries in my spark-default.conf, but not the patch applied. ./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m --hiveconf hive.server2.thrift.port=10001 5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at http://hadoopdev01http://hadoopdev01/.opsdatastore.com:4040 15/03/06 12:34:18 INFO impl.TimelineClientImpl: Timeline service address: http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8188/ws/v1/timeline/ 15/03/06 12:34:18 INFO client.RMProxy: Connecting to ResourceManager at hadoopdev02.opsdatastore.com/192.168.15.154:8050 15/03/06 12:34:18 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers 15/03/06 12:34:18 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 15/03/06 12:34:18 INFO yarn.Client: Setting up container launch context for our AM 15/03/06 12:34:18 INFO yarn.Client: Preparing resources for our AM container 15/03/06 12:34:19 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/03/06 12:34:19 INFO yarn.Client: Uploading resource file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar - hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar 15/03/06 12:34:21 INFO yarn.Client: Setting up the launch environment for our AM container 15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: root 15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: root 15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15/03/06 12:34:21 INFO yarn.Client: Submitting application 18 to ResourceManager 15/03/06 12:34:21 INFO impl.YarnClientImpl: Submitted application application_1425078697953_0018 15/03/06 12:34:22 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED) 15/03/06 12:34:22 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1425663261755 final status: UNDEFINED tracking URL: http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8088/proxy/application_1425078697953_0018/ user: root 15/03/06 12:34:23 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED) 15/03/06 12:34:24 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED) 15/03/06 12:34:25 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED) 15/03/06 12:34:26 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED) 15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkyar...@hadoopdev08.opsdatastore.com:40201/user/YarnAM#-557112763] 15/03/06 12:34:27 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS - hadoopdev02.opsdatastore.com, PROXY_URI_BASES - http://hadoopdev02http://hadoopdev02/.opsdatastore.com:8088/proxy/application_1425078697953_0018), /proxy/application_1425078697953_0018 15/03/06 12:34:27 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 15/03/06 12:34:27 INFO yarn.Client: Application report for application_1425078697953_0018 (state: RUNNING) 15/03/06 12:34:27 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: hadoopdev08.opsdatastore.com ApplicationMaster RPC port: 0 queue: default start time: 1425663261755 final status: UNDEFINED tracking URL:
Re: Spark-SQL and Hive - is Hive required?
On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura sandeepv...@gmail.com wrote: Can i get document how to create that setup .i mean i need hive integration on spark http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
Hi Zhan, I applied the patch you recommended, https://github.com/apache/spark/pull/3409, it it now works. It was failing with this: Exception message: /hadoop/yarn/local/usercache/root/appcache/application_1425078697953_0020/container_1425078697953_0020_01_02/launch_container.sh: line 14: $PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/ *${hdp.version}*/hadoop/lib/hadoop-lzo-0.6.0.*${hdp.version}*.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*: *bad substitution* While the spark-default.conf has these defined: spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041 without the patch *${hdp.version} * was not being substituted. Thanks for pointing me to that patch, appreciate it. -Todd On Fri, Mar 6, 2015 at 1:12 PM, Zhan Zhang zzh...@hortonworks.com wrote: Hi Todd, Looks like the thrift server can connect to metastore, but something wrong in the executors. You can try to get the log with yarn logs -applicationID xxx” to check why it failed. If there is no log (master or executor is not started at all), you can go to the RM webpage, click the link to see why the shell failed in the first place. Thanks. Zhan Zhang On Mar 6, 2015, at 9:59 AM, Todd Nist tsind...@gmail.com wrote: First, thanks to everyone for their assistance and recommendations. @Marcelo I applied the patch that you recommended and am now able to get into the shell, thank you worked great after I realized that the pom was pointing to the 1.3.0-SNAPSHOT for parent, need to bump that down to 1.2.1. @Zhan Need to apply this patch next. I tried to start the spark-thriftserver but and it starts, then fails with like this: I have the entries in my spark-default.conf, but not the patch applied. ./sbin/start-thriftserver.sh --master yarn --executor-memory 1024m --hiveconf hive.server2.thrift.port=10001 5/03/06 12:34:17 INFO ui.SparkUI: Started SparkUI at http://hadoopdev01.opsdatastore.com:404015/03/06 12:34:18 INFO impl.TimelineClientImpl: Timeline service address: http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/15/03/06 12:34:18 INFO client.RMProxy: Connecting to ResourceManager at hadoopdev02.opsdatastore.com/192.168.15.154:805015/03/06 12:34:18 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers15/03/06 12:34:18 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)15/03/06 12:34:18 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead15/03/06 12:34:18 INFO yarn.Client: Setting up container launch context for our AM15/03/06 12:34:18 INFO yarn.Client: Preparing resources for our AM container15/03/06 12:34:19 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.15/03/06 12:34:19 INFO yarn.Client: Uploading resource file:/root/spark-1.2.1-bin-hadoop2.6/lib/spark-assembly-1.2.1-hadoop2.6.0.jar - hdfs://hadoopdev01.opsdatastore.com:8020/user/root/.sparkStaging/application_1425078697953_0018/spark-assembly-1.2.1-hadoop2.6.0.jar15/03/06 12:34:21 INFO yarn.Client: Setting up the launch environment for our AM container15/03/06 12:34:21 INFO spark.SecurityManager: Changing view acls to: root15/03/06 12:34:21 INFO spark.SecurityManager: Changing modify acls to: root15/03/06 12:34:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)15/03/06 12:34:21 INFO yarn.Client: Submitting application 18 to ResourceManager15/03/06 12:34:21 INFO impl.YarnClientImpl: Submitted application application_1425078697953_001815/03/06 12:34:22 INFO yarn.Client: Application report for application_1425078697953_0018 (state: ACCEPTED)15/03/06 12:34:22 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1425663261755 final status: UNDEFINED tracking URL: http://hadoopdev02.opsdatastore.com:8088/proxy/application_1425078697953_0018/ user: root15/03/06 12:34:23 INFO yarn.Client: Application report for
Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
Sorry. Misunderstanding. Looks like it already worked. If you still met some hdp.version problem, you can try it :) Thanks. Zhan Zhang On Mar 6, 2015, at 11:40 AM, Zhan Zhang zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote: You are using 1.2.1 right? If so, please add java-opts in conf directory and give it a try. [root@c6401 conf]# more java-opts -Dhdp.version=2.2.2.0-2041 Thanks. Zhan Zhang On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote: -Dhdp.version=2.2.0.0-2041
Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
You are using 1.2.1 right? If so, please add java-opts in conf directory and give it a try. [root@c6401 conf]# more java-opts -Dhdp.version=2.2.2.0-2041 Thanks. Zhan Zhang On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.commailto:tsind...@gmail.com wrote: -Dhdp.version=2.2.0.0-2041
Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
Working great now, after applying that patch; thanks again. On Fri, Mar 6, 2015 at 2:42 PM, Zhan Zhang zzh...@hortonworks.com wrote: Sorry. Misunderstanding. Looks like it already worked. If you still met some hdp.version problem, you can try it :) Thanks. Zhan Zhang On Mar 6, 2015, at 11:40 AM, Zhan Zhang zzh...@hortonworks.com wrote: You are using 1.2.1 right? If so, please add java-opts in conf directory and give it a try. [root@c6401 conf]# more java-opts -Dhdp.version=2.2.2.0-2041 Thanks. Zhan Zhang On Mar 6, 2015, at 11:35 AM, Todd Nist tsind...@gmail.com wrote: -Dhdp.version=2.2.0.0-2041
Re: spark-sorted, or secondary sort and streaming reduce for spark
Hi Koert, Would you like to register this on spark-packages.org? Burak On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote: currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations like combineByKey, reduceByKey and foldByKey rely on pushing the operation map-side so that the data reduce-side is small. and groupByKey simply requires that the values per key fit in memory. but there are algorithms for which we would like to process all the values per key reduce-side, even when they do not fit in memory. examples are algorithms that need to process the values ordered, or algorithms that need to emit all values again. basically this is what the original hadoop reduce operation did so well: it allowed sorting of values (using secondary sort), and it processed all values per key in a streaming fashion. the library spark-sorted aims to bring these kind of operations back to spark, by providing a way to process values with a user provided Ordering[V] and a user provided streaming operation Iterator[V] = Iterator[W]. it does not make the assumption that the values need to fit in memory per key. the basic idea is to rely on spark's sort-based shuffle to re-arrange the data so that all values for a given key are placed consecutively within a single partition, and then process them using a map-like operation. you can find the project here: https://github.com/tresata/spark-sorted the project is in a very early stage. any feedback is very much appreciated.
Re: Spark-SQL and Hive - is Hive required?
Only if you want to configure the connection to an existing hive metastore. On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura sandeepv...@gmail.com wrote: Hi , For creating a Hive table do i need to add hive-site.xml in spark/conf directory. On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust mich...@databricks.com wrote: Its not required, but even if you don't have hive installed you probably still want to use the HiveContext. From earlier in that doc: In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the ability to read data from Hive tables. To use a HiveContext, *you do not need to have an existing Hive setup*, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build. If these dependencies are not a problem for your application then using HiveContext is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to feature parity with a HiveContext. On Fri, Mar 6, 2015 at 7:22 AM, Yin Huai yh...@databricks.com wrote: Hi Edmon, No, you do not need to install Hive to use Spark SQL. Thanks, Yin On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli ebeg...@gmail.com wrote: Does Spark-SQL require installation of Hive for it to run correctly or not? I could not tell from this statement: https://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive Thank you, Edmon
Re: Data Frame types
No, the UDT API is not a public API as we have not stabilized the implementation. For this reason its only accessible to projects inside of Spark. On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi Cesar, Yes, you can define an UDT with the new DataFrame, the same way that SchemaRDD did. Jaonary On Fri, Mar 6, 2015 at 4:22 PM, Cesar Flores ces...@gmail.com wrote: The SchemaRDD supports the storage of user defined classes. However, in order to do that, the user class needs to extends the UserDefinedType interface (see for example VectorUDT in org.apache.spark.mllib.linalg). My question is: Do the new Data Frame Structure (to be released in spark 1.3) will be able to handle user defined classes too? Do user classes will need to extend they will need to define the same approach? -- Cesar Flores
Re: Spark-SQL and Hive - is Hive required?
On Fri, Mar 6, 2015 at 11:56 AM, sandeep vura sandeepv...@gmail.com wrote: Yes i want to link with existing hive metastore. Is that the right way to link to hive metastore . Yes.
Re: Help with transformWith in SparkStreaming
Yes this is the problem. I want to return an RDD but it is abstract and I cannot instantiate it. So what are other options. I have two streams and I want to filter this stream on the basis of other and also want keep the value of other stream. I have also tried join. But one stream has more values than other in each sliding window and after join I get repetitions which I don't want. Regards,Laeeq On Friday, March 6, 2015 8:11 PM, Sean Owen so...@cloudera.com wrote: What is this line supposed to mean? RDD[(first,second,third)] It's not valid as a line of code, and you don't instantiate RDDs anyway. On Fri, Mar 6, 2015 at 7:06 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I am filtering first DStream with the value in second DStream. I also want to keep the value of second Dstream. I have done the following and having problem with returning new RDD: val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: RDD[(String,String)], rdd2 : RDD[Int]) = { var first = ; var second = ; var third = 0 if (rdd2.first=3) { first = rdd1.map(_._1).first second = rdd1.map(_._2).first third = rdd2.first } RDD[(first,second,third)] }) ERROR /home/hduser/Projects/scalaad/src/main/scala/eeg/anomd/StreamAnomalyDetector.scala:119: error: not found: value RDD [ERROR] RDD[(first,second,third)] I am imported the import org.apache.spark.rdd.RDD Regards, Laeeq
Re: Spark Streaming Switchover Time
It is probably the time taken by the system to figure out that the worker is down. Could you see in the logs to find what goes on when you kill the worker? TD On Wed, Mar 4, 2015 at 6:20 AM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Indeed. And am wondering if this switchover time can be decreased. Cheers, [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] *Nastooh Avessta* ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: *+1 604 647 1527 %2B1%20604%20647%201527* *Cisco Systems Limited* 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.com http://www.cisco.com/ [image: Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy http://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Tuesday, March 03, 2015 11:11 PM *To:* Nastooh Avessta (navesta) *Cc:* user@spark.apache.org *Subject:* Re: Spark Streaming Switchover Time I am confused. Are you killing the 1st worker node to see whether the system restarts the receiver on the second worker? TD On Tue, Mar 3, 2015 at 10:49 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: This is the time that it takes for the driver to start receiving data once again, from the 2nd worker, when the 1st worker, where streaming thread was initially running, is shutdown. Cheers, [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] *Nastooh Avessta* ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: *+1 604 647 1527 %2B1%20604%20647%201527* *Cisco Systems Limited* 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.com http://www.cisco.com/ [image: Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy http://www.cisco.com/web/siteassets/legal/privacy.html* *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Tuesday, March 03, 2015 10:24 PM *To:* Nastooh Avessta (navesta) *Cc:* user@spark.apache.org *Subject:* Re: Spark Streaming Switchover Time Can you elaborate on what is this switchover time? TD On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in client mode, running a udp streaming application, I am noting around 2 second elapse time on switchover, upon shutting down the streaming worker, where streaming window length is 1 sec. I am wondering what parameters are available to the developer to shorten this switchover time. Cheers, [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] *Nastooh Avessta* ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: *+1 604 647 1527 %2B1%20604%20647%201527* *Cisco Systems Limited* 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.com http://www.cisco.com/ [image: Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html
Re: Compile Spark with Maven Zinc Scala Plugin
Tried with that. No luck. Same error on abt-interface jar. I can see maven downloaded that jar into my .m2 cache On Friday, March 6, 2015, 鹰 980548...@qq.com wrote: try it with mvn -DskipTests -Pscala-2.11 clean install package
Store the shuffled files in memory using Tachyon
Hi all, Is it possible to store Spark shuffled files on a distributed memory like Tachyon instead of spilling them to disk? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Store-the-shuffled-files-in-memory-using-Tachyon-tp21944.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-stream programme failed on yarn-client
Looks like an issue with your yarn setup, could you try doing a simple example with spark-shell? Start the spark shell as: $*MASTER=yarn-client bin/spark-shell* *spark-shell *sc.parallelize(1 to 1000).collect If that doesn't work, then make sure your yarn services are up and running and in your spark-env.sh you may set the corresponding configurations from the following: # Options read in YARN client mode # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files # - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2) # - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1). # - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G) # - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb) Thanks Best Regards On Fri, Mar 6, 2015 at 1:09 PM, fenghaixiong 980548...@qq.com wrote: Hi all, I'm try to write a spark stream programme so i read the spark online document ,according the document i write the programe like this : import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object SparkStreamTest { def main(args: Array[String]) { val conf = new SparkConf() val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream(args(0), args(1).toInt) val words = lines.flatMap(_.split( )) val pairs = words.map(word = (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate } } for test i first start listen a port by this: nc -lk and then i submit job by spark-submit --master local[2] --class com.nd.hxf.SparkStreamTest spark-test-tream-1.0-SNAPSHOT-job.jar localhost everything is okay but when i run it on yarn by this : spark-submit --master yarn-client --class com.nd.hxf.SparkStreamTest spark-test-tream-1.0-SNAPSHOT-job.jar localhost it wait for a longtime and repeat output somemessage a apart of the output is like this: 15/03/06 15:30:24 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 3(ms) 15/03/06 15:30:24 INFO ReceiverTracker: ReceiverTracker started 15/03/06 15:30:24 INFO ForEachDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO ShuffledDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO MappedDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO FlatMappedDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO SocketInputDStream: metadataCleanupDelay = -1 15/03/06 15:30:24 INFO SocketInputDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO SocketInputDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO SocketInputDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO SocketInputDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO SocketInputDStream: Initialized and validated org.apache.spark.streaming.dstream.SocketInputDStream@b01c5f8 15/03/06 15:30:24 INFO FlatMappedDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO FlatMappedDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO FlatMappedDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO FlatMappedDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO FlatMappedDStream: Initialized and validated org.apache.spark.streaming.dstream.FlatMappedDStream@6bd47453 15/03/06 15:30:24 INFO MappedDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO MappedDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO MappedDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream@941451f 15/03/06 15:30:24 INFO ShuffledDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO ShuffledDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO ShuffledDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO ShuffledDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO ShuffledDStream: Initialized and validated org.apache.spark.streaming.dstream.ShuffledDStream@42eba6ee 15/03/06 15:30:24 INFO ForEachDStream: Slide time = 1000 ms 15/03/06 15:30:24 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/06 15:30:24 INFO ForEachDStream: Checkpoint interval = null 15/03/06 15:30:24 INFO ForEachDStream: Remember duration = 1000 ms 15/03/06 15:30:24 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@48d166b5 15/03/06 15:30:24 INFO SparkContext: Starting job: start at SparkStreamTest.scala:21 15/03/06 15:30:24 INFO