Re: Where is Redgate's HDFS explorer?
You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote: http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: History server is not receiving any event
Are you starting your history server? ./sbin/start-history-server.sh You can read more here http://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact Thanks Best Regards On Tue, Aug 25, 2015 at 1:07 AM, b.bhavesh b.borisan...@gmail.com wrote: Hi, I am working on streaming application. I tried to configure history server to persist the events of application in hadoop file system (hdfs). However, it is not logging any events. I am running Apache Spark 1.4.1 (pyspark) under Ubuntu 14.04 with three nodes. Here is my configuration: File - /usr/local/spark/conf/spark-defaults.conf#In all three nodes spark.eventLog.enabled true spark.eventLog.dir hdfs://master-host:port/usr/local/hadoop/spark_log #in master node export SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://host:port/usr/local/hadoop/spark_log Can someone give list of steps to configure history server. Thanks and regards, b.bhavesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/History-server-is-not-receiving-any-event-tp24426.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Where is Redgate's HDFS explorer?
See https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html FYI On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote: http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Array Out OF Bound Exception
I suspect in the last scenario you are having an empty new line at the last line. If you put a try..catch you'd definitely know. Thanks Best Regards On Tue, Aug 25, 2015 at 2:53 AM, Michael Armbrust mich...@databricks.com wrote: This top line here is indicating that the exception is being throw from your code (i.e. code written in the console). at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40) Check to make sure that you are properly handling data that has less columns than you would expect. On Mon, Aug 24, 2015 at 12:41 PM, SAHA, DEBOBROTA ds3...@att.com wrote: Hi , I am using SPARK 1.4 and I am getting an array out of bound Exception when I am trying to read from a registered table in SPARK. For example If I have 3 different text files with the content as below: *Scenario 1*: A1|B1|C1 A2|B2|C2 *Scenario 2*: A1| |C1 A2| |C2 *Scenario 3*: A1| B1| A2| B2| So for Scenario 1 and 2 it’s working fine but for Scenario 3 I am getting the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:38) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to (TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Please help. Thanks, Debobrota
Re: Where is Redgate's HDFS explorer?
If HDFS is on a linux VM, you could also mount it with FUSE and export it with samba 2015-08-29 2:26 GMT-07:00 Ted Yu yuzhih...@gmail.com: See https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html FYI On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote: http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Where is Redgate's HDFS explorer?
It depends, if HDFS is running under windows, FUSE won't work, but if HDFS is on a linux VM, Box, or cluster, then you can have the linux box/vm mount HDFS through FUSE and at the same time export its mount point on samba. At that point, your windows machine can just connect to the samba share. R. 2015-08-29 4:04 GMT-07:00 Dino Fancellu d...@felstar.com: I'm using Windows. Are you saying it works with Windows? Dino. On 29 August 2015 at 09:04, Akhil Das ak...@sigmoidanalytics.com wrote: You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote: http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Where is Redgate's HDFS explorer?
I'm using Windows. Are you saying it works with Windows? Dino. On 29 August 2015 at 09:04, Akhil Das ak...@sigmoidanalytics.com wrote: You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote: http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Apache Spark Suitable JDBC Driver not found
0 down vote favorite I am using Apache Spark for analyzing query log. I already faced some difficulties to setup spark. Now I am using a standalone cluster to process queries. First I used example code in java to count words that worked fine. But when I try to connect it to a MySQL server problem arises. I am using ubuntu 14.04 LTS, 64 bit. Spark version 1.4.1, Mysql 5.1. This is my code, when I use Master Url instead of [Local*] I get the error No suitable driver found. I have included the log. Here's the StackOverFlow link to the Detail of the problem - http://stackoverflow.com/questions/32280276/apache-spark-mysql-connection-suitable-jdbc-driver-not-found?noredirect=1#comment52445586_32280276 http://http://stackoverflow.com/questions/32280276/apache-spark-mysql-connection-suitable-jdbc-driver-not-found?noredirect=1#comment52445586_32280276 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Suitable-JDBC-Driver-not-found-tp24505.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Array Out OF Bound Exception
So either you empty line at the end or when you use string.split you dont specify -1 as second parameter... On Aug 29, 2015 1:18 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I suspect in the last scenario you are having an empty new line at the last line. If you put a try..catch you'd definitely know. Thanks Best Regards On Tue, Aug 25, 2015 at 2:53 AM, Michael Armbrust mich...@databricks.com wrote: This top line here is indicating that the exception is being throw from your code (i.e. code written in the console). at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40) Check to make sure that you are properly handling data that has less columns than you would expect. On Mon, Aug 24, 2015 at 12:41 PM, SAHA, DEBOBROTA ds3...@att.com wrote: Hi , I am using SPARK 1.4 and I am getting an array out of bound Exception when I am trying to read from a registered table in SPARK. For example If I have 3 different text files with the content as below: *Scenario 1*: A1|B1|C1 A2|B2|C2 *Scenario 2*: A1| |C1 A2| |C2 *Scenario 3*: A1| B1| A2| B2| So for Scenario 1 and 2 it’s working fine but for Scenario 3 I am getting the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:38) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to (TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) at
Re: How to set environment of worker applications
Finally, I found the solution: on the spark context you can set spark.executorEnv.[EnvironmentVariableName] and these will be available in the environment of the executors This is in fact documented, but somehow I missed it. https://spark.apache.org/docs/latest/configuration.html#runtime-environment Thanks for all the help though. Jan On 25 Aug 2015, at 08:57, Hemant Bhanawat hemant9...@gmail.com wrote: Ok, I went in the direction of system vars since beginning probably because the question was to pass variables to a particular job. Anyway, the decision to use either system vars or environment vars would solely depend on whether you want to make them available to all the spark processes on a node or to a particular job. Are there any other reasons why one would prefer one over the other? On Mon, Aug 24, 2015 at 8:48 PM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: System properties and environment variables are two different things.. One can use spark.executor.extraJavaOptions to pass system properties and spark-env.sh to pass environment variables. -raghav On Mon, Aug 24, 2015 at 1:00 PM, Hemant Bhanawat hemant9...@gmail.com wrote: That's surprising. Passing the environment variables using spark.executor.extraJavaOptions=-Dmyenvvar=xxx to the executor and then fetching them using System.getProperty(myenvvar) has worked for me. What is the error that you guys got? On Mon, Aug 24, 2015 at 12:10 AM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: spark-env.sh works for me in Spark 1.4 but not spark.executor.extraJavaOptions. On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I think the only way to pass on environment variables to worker node is to write it in spark-env.sh file on each worker node. On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com wrote: Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the following article. I think you can use -D to pass system vars: spark.apache.org/docs/latest/configuration.html#runtime-environment Hi, I am starting a spark streaming job in standalone mode with spark-submit. Is there a way to make the UNIX environment variables with which spark-submit is started available to the processes started on the worker nodes? Jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Alternative to Large Broadcast Variables
We are using Cassandra for similar kind of problem and it works well... You need to take care of race condition between updating the store and looking up the store... On Aug 29, 2015 1:31 AM, Ted Yu yuzhih...@gmail.com wrote: +1 on Jason's suggestion. bq. this large variable is broadcast many times during the lifetime Please consider making this large variable more granular. Meaning, reduce the amount of data transferred between the key value store and your app during update. Cheers On Fri, Aug 28, 2015 at 12:44 PM, Jason ja...@jasonknight.us wrote: You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing the idea out there. On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff j...@atware.co.jp wrote: Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process. From what I have observed perhaps 60% of the execution time is spent waiting for the variable to broadcast in each iteration. My reading of a Spark performance article[1] suggests that the time spent broadcasting will increase with the number of nodes I add. My question for the group - what would you suggest as an alternative to broadcasting a large variable like this? One approach I have considered is segmenting my RDD and adding a copy of the lookup table for each X number of values to process. So, for example, if I have a list of 1 million entries to process (eg, RDD[Entry]), I could split this into segments of 100K entries, with a copy of the lookup table, and make that an RDD[(Lookup, Array[Entry]). Another solution I am looking at it is making the lookup table an RDD instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to improve performance. One issue with this approach is that I would have to rewrite my application code to use two RDDs so that I do not reference the lookup RDD in the from within the closure of another RDD. Any other recommendations? Jeff [1] http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf [2]https://github.com/amplab/spark-indexedrdd
Re: Build k-NN graph for large dataset
Yes you need to use dimensionality reduction and/or locality sensitive hashing to reduce number of pairs to compare. There is also LSH implementation for collection of vectors I have just published here: https://github.com/marufaytekin/lsh-spark. Implementation i based on this paper: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf I hope It will help. Maruf On Thu, Aug 27, 2015 at 9:16 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Thank you all for these links. I'll check them. On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack charles.t.h...@gmail.com wrote: +1 to all of the above esp. Dimensionality reduction and locality sensitive hashing / min hashing. There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter for this purpose. I've been meaning to try it and would be interested to hear about results you get. https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum Charlie — Sent from Mailbox On Wednesday, Aug 26, 2015 at 09:57, Michael Malak michaelma...@yahoo.com.invalid, wrote: Yes. And a paper that describes using grids (actually varying grids) is http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter 7, which should become available in the MEAP mid-September. http://www.manning.com/books/spark-graphx-in-action -- If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation: http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based on some kind similarity defined on the vertex spaces. My problem is to implement an efficient algorithm to compute the weight matrix of the graph. I need to compute a N*N similarities and the only way I know is to use cartesian operation follow by map operation on RDD. But, this is very slow when the N is large. Is there a more cleaver way to do this for an arbitrary similarity function ? Cheers, Jao
Re: Spark-on-YARN LOCAL_DIRS location
Yes, you can set the SPARK_LOCAL_DIR in the spark-env.sh or spark.local.dir in the spark-defaults.conf file, then it would use this location for the shuffle writes etc. Thanks Best Regards On Wed, Aug 26, 2015 at 6:56 PM, michael.engl...@nomura.com wrote: Hi, I am having issues with /tmp space filling up during Spark jobs because Spark-on-YARN uses the yarn.nodemanager.local-dirs for shuffle space. I noticed this message appears when submitting Spark-on-YARN jobs: WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). I can’t find much documentation on where to set the LOCAL_DIRS property. Please can someone advise whether this is a yarn-env.sh or a spark-env.sh property and whether it would then use the directory specified by this env variable as a shuffle area instead of the default yarn.nodemanager.local-dirs location? Thanks, Mike This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to Nomura is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm
Re: Is there a way to store RDD and load it with its original format?
You can do a rdd.saveAsObjectFile for storing and for reading you can do a sc.objectFile Thanks Best Regards On Thu, Aug 27, 2015 at 9:29 PM, saif.a.ell...@wellsfargo.com wrote: Hi, Any way to store/load RDDs keeping their original object instead of string? I am having trouble with parquet (there is always some error at class conversion), and don’t use hadoop. Looking for alternatives. Thanks in advance Saif
Re: Spark MLLIB multiclass calssification
Thank you, I saw this before, but it is just a binary classification, so how can I extract this to multiple classification. Simply add different labels? e.g.: new LabeledDocument(0L, a b c d e spark, 1.0), new LabeledDocument(1L, b d, 0.0), new LabeledDocument(2L, hadoop f g h, 2.0), On Sun, Aug 30, 2015 at 7:32 AM, Feynman Liang fli...@databricks.com wrote: I would check out the Pipeline code example https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline On Sat, Aug 29, 2015 at 9:23 PM, Zsombor Egyed egye...@starschema.net wrote: Hi! I want to implement a multiclass classification for documents. So I have different kinds of text files, and I want to classificate them with spark mllib in java. Do you have any code examples? Thanks! -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@starschemaltd Email: egye...@starschema.net bali...@starschema.net | Web: www.starschema.net -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@starschemaltd Email: egye...@starschema.net bali...@starschema.net | Web: www.starschema.net
Re: How to avoid shuffle errors for a large join ?
Can you try 1.5? This should work much, much better in 1.5 out of the box. For 1.4, I think you'd want to turn on sort-merge-join, which is off by default. However, the sort-merge join in 1.4 can still trigger a lot of garbage, making it slower. SMJ performance is probably 5x - 1000x better in 1.5 for your case. On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak tom...@gmail.com wrote: I'm getting errors like Removing executor with no recent heartbeats Missing an output location for shuffle errors for a large SparkSql join (1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to configure the job to avoid them. The initial stage completes fine with some 30k tasks on a cluster with 70 machines/10TB memory, generating about 6.5TB of shuffle writes, but then the shuffle stage first waits 30min in the scheduling phase according to the UI, and then dies with the mentioned errors. I can see in the GC logs that the executors reach their memory limits (32g per executor, 2 workers per machine) and can't allocate any more stuff in the heap. Fwiw, the top 10 in the memory use histogram are: num #instances #bytes class name -- 1: 24913959511958700560 scala.collection.immutable.HashMap$HashMap1 2: 251085327 8034730464 scala.Tuple2 3: 243694737 5848673688 java.lang.Float 4: 231198778 5548770672 java.lang.Integer 5: 72191585 4298521576 [Lscala.collection.immutable.HashMap; 6: 72191582 2310130624 scala.collection.immutable.HashMap$HashTrieMap 7: 74114058 1778737392 java.lang.Long 8: 6059103 779203840 [Ljava.lang.Object; 9: 5461096 174755072 scala.collection.mutable.ArrayBuffer 10: 34749 70122104 [B Relevant settings are (Spark 1.4.1, Java 8 with G1 GC): spark.core.connection.ack.wait.timeout 600 spark.executor.heartbeatInterval 60s spark.executor.memory 32g spark.mesos.coarse false spark.network.timeout 600s spark.shuffle.blockTransferService netty spark.shuffle.consolidateFiles true spark.shuffle.file.buffer 1m spark.shuffle.io.maxRetries6 spark.shuffle.manager sort The join is currently configured with spark.sql.shuffle.partitions=1000 but that doesn't seem to help. Would increasing the partitions help ? Is there a formula to determine an approximate partitions number value for a join ? Any help with this job would be appreciated ! cheers, Tom
Spark shell and StackOverFlowError
I am running the Spark shell (1.2.1) in local mode and I have a simple RDD[(String,String,Double)] with about 10,000 objects in it. I get a StackOverFlowError each time I try to run the following code (the code itself is just representative of other logic where I need to pass in a variable). I tried broadcasting the variable too, but no luck .. missing something basic here - val rdd = sc.makeRDD(List(Data read from file) val a=10 rdd.map(r = if (a==10) 1 else 0) This throws - java.lang.StackOverflowError at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:318) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1133) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) ... ... More experiments .. this works - val lst = Range(0,1).map(i=(10,10,i:Double)).toList sc.makeRDD(lst).map(i= if(a==10) 1 else 0) But below doesn't and throws the StackoverflowError - val lst = MutableList[(String,String,Double)]() Range(0,1).foreach(i=lst+=((10,10,i:Double))) sc.makeRDD(lst).map(i= if(a==10) 1 else 0) Any help appreciated! Thanks, Ashish -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Setting number of CORES from inside the Topology (JAVA code )
When you set .setMaster to local[4], it means that you are allocating 4 threads on your local machine. You can change it to local[1] to run it on a single thread. If you are submitting the job to a standalone spark cluster and you wanted to limit the # cores for your job, then you can do it like *sparkConf.set(spark.cores.max, 224)* Thanks Best Regards On Wed, Aug 26, 2015 at 7:26 PM, anshu shukla anshushuk...@gmail.com wrote: Hey , I need to set the number of cores from inside the topology . Its working fine by setting in spark-env.sh but unable to do via setting key/value for conf . SparkConf sparkConf = new SparkConf().setAppName(JavaCustomReceiver).setMaster(local[4]); if(toponame.equals(IdentityTopology)) { sparkConf.setExecutorEnv(SPARK_WORKER_CORES,1); } -- Thanks Regards, Anshu Shukla
Spark MLLIB multiclass calssification
Hi! I want to implement a multiclass classification for documents. So I have different kinds of text files, and I want to classificate them with spark mllib in java. Do you have any code examples? Thanks! -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@starschemaltd Email: egye...@starschema.net bali...@starschema.net | Web: www.starschema.net
Re: How to remove worker node but let it finish first?
It's only available in Mesos? I'm using spark standalone cluster, is there anything about it there? On Fri, Aug 28, 2015 at 8:51 AM Akhil Das ak...@sigmoidanalytics.com wrote: You can create a custom mesos framework for your requirement, to get you started you can check this out http://mesos.apache.org/documentation/latest/app-framework-development-guide/ Thanks Best Regards On Mon, Aug 24, 2015 at 12:11 PM, Romi Kuntsman r...@totango.com wrote: Hi, I have a spark standalone cluster with 100s of applications per day, and it changes size (more or less workers) at various hours. The driver runs on a separate machine outside the spark cluster. When a job is running and it's worker is killed (because at that hour the number of workers is reduced), it sometimes fails, instead of redistributing the work to other workers. How is it possible to decomission a worker, so that it doesn't receive any new work, but does finish all existing work before shutting down? Thanks!
Re: Alternative to Large Broadcast Variables
Thanks for the recommendations. I had been focused on solving the problem within Spark but a distributed database sounds like a better solution. Jeff On Sat, Aug 29, 2015 at 11:47 PM, Ted Yu yuzhih...@gmail.com wrote: Not sure if the race condition you mentioned is related to Cassandra's data consistency model. If hbase is used as the external key value store, atomicity is guaranteed. Cheers On Sat, Aug 29, 2015 at 7:40 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: We are using Cassandra for similar kind of problem and it works well... You need to take care of race condition between updating the store and looking up the store... On Aug 29, 2015 1:31 AM, Ted Yu yuzhih...@gmail.com wrote: +1 on Jason's suggestion. bq. this large variable is broadcast many times during the lifetime Please consider making this large variable more granular. Meaning, reduce the amount of data transferred between the key value store and your app during update. Cheers On Fri, Aug 28, 2015 at 12:44 PM, Jason ja...@jasonknight.us wrote: You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing the idea out there. On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff j...@atware.co.jp wrote: Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process. From what I have observed perhaps 60% of the execution time is spent waiting for the variable to broadcast in each iteration. My reading of a Spark performance article[1] suggests that the time spent broadcasting will increase with the number of nodes I add. My question for the group - what would you suggest as an alternative to broadcasting a large variable like this? One approach I have considered is segmenting my RDD and adding a copy of the lookup table for each X number of values to process. So, for example, if I have a list of 1 million entries to process (eg, RDD[Entry]), I could split this into segments of 100K entries, with a copy of the lookup table, and make that an RDD[(Lookup, Array[Entry]). Another solution I am looking at it is making the lookup table an RDD instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to improve performance. One issue with this approach is that I would have to rewrite my application code to use two RDDs so that I do not reference the lookup RDD in the from within the closure of another RDD. Any other recommendations? Jeff [1] http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf [2]https://github.com/amplab/spark-indexedrdd
Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs
I am doing some memory tuning on my Spark job on YARN and I notice different settings would give different results and affect the outcome of the Spark job run. However, I am confused and do not understand completely why it happens and would appreciate if someone can provide me with some guidance and explanation. I will provide some background information and describe the cases that I have experienced and post my questions after them below. *My environment setting were as below:* - Memory 20G, 20 VCores per node (3 nodes in total) - Hadoop 2.6.0 - Spark 1.4.0 My code recursively filters an RDD to make it smaller (removing examples as part of an algorithm), then does mapToPair and collect to gather the results and save them within a list. First Case /`/bin/spark-submit --class class name --master yarn-cluster --driver-memory 7g --executor-memory 1g --num-executors 3 --executor-cores 1 --jars jar file` / If I run my program with any driver memory less than 11g, I will get the error below which is the SparkContext being stopped or a similar error which is a method being called on a stopped SparkContext. From what I have gathered, this is related to memory not being enough. http://apache-spark-user-list.1001560.n3.nabble.com/file/n24507/EKxQD.png Second Case /`/bin/spark-submit --class class name --master yarn-cluster --driver-memory 7g --executor-memory 3g --num-executors 3 --executor-cores 1 --jars jar file`/ If I run the program with the same driver memory but higher executor memory, the job runs longer (about 3-4 minutes) than the first case and then it will encounter a different error from earlier which is a Container requesting/using more memory than allowed and is being killed because of that. Although I find it weird since the executor memory is increased and this error occurs instead of the error in the first case. http://apache-spark-user-list.1001560.n3.nabble.com/file/n24507/tr24f.png Third Case /`/bin/spark-submit --class class name --master yarn-cluster --driver-memory 11g --executor-memory 1g --num-executors 3 --executor-cores 1 --jars jar file`/ Any setting with driver memory greater than 10g will lead to the job being able to run successfully. Fourth Case /`/bin/spark-submit --class class name --master yarn-cluster --driver-memory 2g --executor-memory 1g --conf spark.yarn.executor.memoryOverhead=1024 --conf spark.yarn.driver.memoryOverhead=1024 --num-executors 3 --executor-cores 1 --jars jar file` / The job will run successfully with this setting (driver memory 2g and executor memory 1g but increasing the driver memory overhead(1g) and the executor memory overhead(1g). Questions 1. Why is a different error thrown and the job runs longer (for the second case) between the first and second case with only the executor memory being increased? Are the two errors linked in some way? 2. Both the third and fourth case succeeds and I understand that it is because I am giving more memory which solves the memory problems. However, in the third case, /spark.driver.memory + spark.yarn.driver.memoryOverhead = the memory that YARN will create a JVM = 11g + (driverMemory * 0.07, with minimum of 384m) = 11g + 1.154g = 12.154g/ So, from the formula, I can see that my job requires MEMORY_TOTAL of around 12.154g to run successfully which explains why I need more than 10g for the driver memory setting. But for the fourth case, / spark.driver.memory + spark.yarn.driver.memoryOverhead = the memory that YARN will create a JVM = 2 + (driverMemory * 0.07, with minimum of 384m) = 2g + 0.524g = 2.524g / It seems that just by increasing the memory overhead by a small amount of 1024(1g) it leads to the successful run of the job with driver memory of only 2g and the MEMORY_TOTAL is only 2.524g! Whereas without the overhead configuration, driver memory less than 11g fails but it doesn't make sense from the formula which is why I am confused. Why increasing the memory overhead (for both driver and executor) allows my job to complete successfully with a lower MEMORY_TOTAL (12.154g vs 2.524g)? Is there some other internal things at work here that I am missing? I would really appreciate any helped offered as it would really help with my understanding of Spark. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Effects-of-Driver-Memory-Executor-Memory-Driver-Memory-Overhead-and-Executor-Memory-Overhead-os-tp24507.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-submit issue
Did you try putting a sc.stop at the end of your pipeline? Thanks Best Regards On Thu, Aug 27, 2015 at 6:41 PM, pranay pranay.ton...@impetus.co.in wrote: I have a java program that does this - (using Spark 1.3.1 ) Create a command string that uses spark-submit in it ( with my Class file etc ), and i store this string in a temp file somewhere as a shell script Using Runtime.exec, i execute this script and wait for its completion, using process.waitFor Doing ps -ef shows me SparkSubmitDriverBootstrapper , the script running my class ... parent child relationship.. The job gets triggered on spark-cluster and gets over but SparkSubmitDriverBootstrapper still shows up, due to this the process.waitFor never comes out and i can't detect the execution end... If i run the /temp file independently. things work file... only when i trigger /temp scrict inside Runtime.exec , this issue occurs... Any comments ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-issue-tp24474.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: commit DB Transaction for each partition
What problem are you having? you will have to trigger an action at the end to execute this piece of code. Like: rdd.mapPartitions(partitionOfRecords = { DBConnectionInit() val results = partitionOfRecords.map(..) DBConnection.commit() results })*.count()* Thanks Best Regards On Thu, Aug 27, 2015 at 7:32 PM, Ahmed Nawar ahmed.na...@gmail.com wrote: Dears, I needs to commit DB Transaction for each partition,Not for each row. below didn't work for me. rdd.mapPartitions(partitionOfRecords = { DBConnectionInit() val results = partitionOfRecords.map(..) DBConnection.commit() results }) Best regards, Ahmed Atef Nawwar Data Management Big Data Consultant
Re: How to generate spark assembly (jar file) using Intellij
Have you tried `build/sbt assembly`? On Sat, Aug 29, 2015 at 9:03 PM, Muler mulugeta.abe...@gmail.com wrote: Hi guys, I can successfully build Spark using Intellij, but I'm not able to locate/generate spark assembly (jar file) in the assembly/target directly) How do I generate one? I have attached the screenshot of my IDE. Thanks, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark MLLIB multiclass calssification
I would check out the Pipeline code example https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline On Sat, Aug 29, 2015 at 9:23 PM, Zsombor Egyed egye...@starschema.net wrote: Hi! I want to implement a multiclass classification for documents. So I have different kinds of text files, and I want to classificate them with spark mllib in java. Do you have any code examples? Thanks! -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@starschemaltd Email: egye...@starschema.net bali...@starschema.net | Web: www.starschema.net
Re: Spark MLLIB multiclass calssification
I think the spark.ml logistic regression currently only supports 0/1 labels. If you need multiclass, I would suggest to look at either the spark.ml decision trees. If you don't care too much for pipelines, then you could use the spark.mllib logistic regression after featurizing. On Sat, Aug 29, 2015 at 10:49 PM, Zsombor Egyed egye...@starschema.net wrote: Thank you, I saw this before, but it is just a binary classification, so how can I extract this to multiple classification. Simply add different labels? e.g.: new LabeledDocument(0L, a b c d e spark, 1.0), new LabeledDocument(1L, b d, 0.0), new LabeledDocument(2L, hadoop f g h, 2.0), On Sun, Aug 30, 2015 at 7:32 AM, Feynman Liang fli...@databricks.com wrote: I would check out the Pipeline code example https://spark.apache.org/docs/latest/ml-guide.html#example-pipeline On Sat, Aug 29, 2015 at 9:23 PM, Zsombor Egyed egye...@starschema.net wrote: Hi! I want to implement a multiclass classification for documents. So I have different kinds of text files, and I want to classificate them with spark mllib in java. Do you have any code examples? Thanks! -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@starschemaltd Email: egye...@starschema.net bali...@starschema.net | Web: www.starschema.net -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 | Twitter:@starschemaltd Email: egye...@starschema.net bali...@starschema.net | Web: www.starschema.net
Re: Invalid environment variable name when submitting job from windows
I think you have to use the keyword *set* to set an environment variable in windows. Check the section Setting environment variables from http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/ntcmds_shelloverview.mspx?mfr=true Thanks Best Regards On Tue, Aug 25, 2015 at 1:25 PM, Yann ROBIN me.s...@gmail.com wrote: Hi, We have a spark standalone cluster running on linux. We have a job that we submit to the spark cluster on windows. When submitting this job using windows the execution failed with this error in the Notes java.lang.IllegalArgumentException: Invalid environment variable name: =::. When submitting from linux it works fine. I thought that this might be the result of one of the ENV variable on my system so I've modify the submit cmd to remove all env variable except the one needed by Java. This is the env before executing java command : ASSEMBLY_DIR=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib ASSEMBLY_DIR1=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\../assembly/target/scala-2.10 ASSEMBLY_DIR2=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\../assembly/target/scala-2.11 CLASS=org.apache.spark.deploy.SparkSubmit CLASSPATH=.; JAVA_HOME=C:\Program Files\Java\jre1.8.0_51 LAUNCHER_OUTPUT=\spark-class-launcher-output-23386.txt LAUNCH_CLASSPATH=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar PYTHONHASHSEED=0 RUNNER=C:\Program Files\Java\jre1.8.0_51\bin\java SPARK_ASSEMBLY_JAR=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar SPARK_CMD=C:\Program Files\Java\jre1.8.0_51\bin\java -cp c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\conf\;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\datanucleus-api-jdo-3.2.6.jar;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\datanucleus-core-3.2.10.jar;c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\datanucleus-rdbms-3.2.9.jar org.apache.spark.deploy.SparkSubmit --master spark://172.16.8.21:7077 --deploy-mode cluster --conf spark.driver.memory=4G --conf spark.driver.extraClassPath=/opt/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar --class com.publica.Accounts --verbose http://server/data-analytics/data-analytics.jar spark://172.16.8.21:7077 data-analysis http://server/data-analytics/data-analytics.jar 23 8 2015 SPARK_ENV_LOADED=1 SPARK_HOME=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\.. SPARK_SCALA_VERSION=2.10 SystemRoot=C:\Windows user_conf_dir=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\..\conf _SPARK_ASSEMBLY=c:\spark\spark-1.4.0-bin-hadoop2.6\bin\..\lib\spark-assembly-1.4.0-hadoop2.6.0.jar Is there a way to make this works ? -- Yann - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Scala: Overload method by its class type
This is more of a scala related question, have a look at the case classes in scala http://www.scala-lang.org/old/node/107 Thanks Best Regards On Tue, Aug 25, 2015 at 6:55 PM, saif.a.ell...@wellsfargo.com wrote: Hi all, I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE } And on runtime, I create instances of this class with different AnyVal + String types, but the return type of some_method varies. I know I could do this with an implicit object, IF some_method received a type, but in this case, I need to have the TYPE defined on its class instance, so for example: val int_instance = new SomeClass[Int] val str_instance = new SomeClass[String] val result: Boolean = int_instance.some_method(args) 0 --- I expected INT here val result2: Boolean = str_instance.som_method(args) contains “asdfg” I expected STRING here. without compilation errors. Any ideas? I would like to implement something like this: class SomeClass[TYPE] { def some_method(args: Int): Int = { process_integer_overloaded_method } def some_method(args: Int): String = { process_string_overloaded_method } and so on. Any ideas? maybe store classe’s TYPE in a constructor instead as a variable somehow? Thanks Saif