SparkSQL Dataframe : partitionColumn, lowerBound, upperBound, numPartitions in context of reading from MySQL
I'm trying to understand what the following configurations mean and their implication on reading data from a MySQL table. I'm looking for options that will impact my read throughput when reading data from a large table. Thanks. partitionColumn, lowerBound, upperBound, numPartitions These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice thatlowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned.
Re: Yarn client mode: Setting environment variables
Can you give some examples of what variables you are trying to set ? On Thu, Feb 18, 2016 at 1:01 AM, Lin Zhaowrote: > I've been trying to set some environment variables for the spark executors > but haven't had much like. I tried editting conf/spark-env.sh but it > doesn't get through to the executors. I'm running 1.6.0 and yarn, any > pointer is appreciated. > > Thanks, > Lin >
Re: Building Spark behind a proxy
I can do a wget http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and get the file successfully on a shell. On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas vcsub...@gmail.com wrote: At least a part of it is due to connection refused, can you check if curl can reach the URL with proxies - [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: Connection refused from http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom On Thu, Jan 29, 2015 at 11:35 AM, Soumya Simanta soumya.sima...@gmail.com wrote: On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Does the error change on build with and without the built options? What do you mean by build options? I'm just doing ./sbt/sbt assembly from $SPARK_HOME Did you try using maven? and doing the proxy settings there. No I've not tried maven yet. However, I did set proxy settings inside my .m2/setting.xml, but it didn't make any difference.
Re: Building Spark behind a proxy
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Does the error change on build with and without the built options? What do you mean by build options? I'm just doing ./sbt/sbt assembly from $SPARK_HOME Did you try using maven? and doing the proxy settings there. No I've not tried maven yet. However, I did set proxy settings inside my .m2/setting.xml, but it didn't make any difference.
Building Spark behind a proxy
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using ./sbt/sbt assembly and I get the following error. I've set the http and https proxy as well as the JAVA_OPTS. Any idea what am I missing ? [warn] one warning found org.apache.maven.model.building.ModelBuildingException: 1 problem was encountered while building the effective model for org.apache.spark:spark-parent:1.1.1 [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: Connection refused from http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 'parent.relativePath' points at wrong local POM @ line 21, column 11 at org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195) at org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841) at org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664) at org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310) at org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232) at com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61) at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41) at com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128) at com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49) at com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28) at SparkBuild$.projectDefinitions(SparkBuild.scala:165) at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:458) at sbt.Load$$anonfun$24.apply(Load.scala:415) at sbt.Load$$anonfun$24.apply(Load.scala:415) at scala.collection.immutable.Stream.flatMap(Stream.scala:442) at sbt.Load$.loadUnit(Load.scala:415) at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256) at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256) at sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:93) at sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:92) at sbt.BuildLoader.apply(BuildLoader.scala:143) at sbt.Load$.loadAll(Load.scala:312) at sbt.Load$.loadURI(Load.scala:264) at sbt.Load$.load(Load.scala:260) at sbt.Load$.load(Load.scala:251) at sbt.Load$.apply(Load.scala:134) at sbt.Load$.defaultLoad(Load.scala:37) at sbt.BuiltinCommands$.doLoadProject(Main.scala:473) at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:467) at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:467) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$.process(Command.scala:95) at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:100) at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:100) at sbt.State$$anon$1.process(State.scala:179) at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:100) at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:100) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.MainLoop$.next(MainLoop.scala:100) at sbt.MainLoop$.run(MainLoop.scala:93) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:71) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:66) at sbt.Using.apply(Using.scala:25) at sbt.MainLoop$.runWithNewLog(MainLoop.scala:66) at sbt.MainLoop$.runAndClearLast(MainLoop.scala:49) at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:33) at sbt.MainLoop$.runLogged(MainLoop.scala:25) at sbt.StandardMain$.runManaged(Main.scala:57) at sbt.xMain.run(Main.scala:29) at xsbt.boot.Launch$$anonfun$run$1.apply(Launch.scala:109) at xsbt.boot.Launch$.withContextLoader(Launch.scala:129) at xsbt.boot.Launch$.run(Launch.scala:109) at xsbt.boot.Launch$$anonfun$apply$1.apply(Launch.scala:36) at xsbt.boot.Launch$.launch(Launch.scala:117) at xsbt.boot.Launch$.apply(Launch.scala:19) at xsbt.boot.Boot$.runImpl(Boot.scala:44) at xsbt.boot.Boot$.main(Boot.scala:20) at xsbt.boot.Boot.main(Boot.scala) [error]
Spark UI and Spark Version on Google Compute Engine
I'm deploying Spark using the Click to Deploy Hadoop - Install Apache Spark on Google Compute Engine. I can run Spark jobs on the REPL and read data from Google storage. However, I'm not sure how to access the Spark UI in this deployment. Can anyone help? Also, it deploys Spark 1.1. It there an easy way to bump it to Spark 1.2 ? Thanks -Soumya [image: Inline image 1] [image: Inline image 2]
Re: Sharing sqlContext between Akka router and routee actors ...
why do you need a router? I mean cannot you do with just one actor which has the SQLContext inside it? On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, Akka router creates a sqlContext and creates a bunch of routees actors with sqlContext as parameter. The actors then execute query on that sqlContext. Would this pattern be a issue ? Any other way sparkContext etc. should be shared cleanly in Akka routers/routees ? Thanks,
Trying to understand a basic difference between these two configurations
I'm trying to understand the conceptual difference between these two configurations in term of performance (using Spark standalone cluster) Case 1: 1 Node 60 cores 240G of memory 50G of data on local file system Case 2: 6 Nodes 10 cores per node 40G of memory per node 50G of data on HDFS nodes are connected using a 10G network I just wanted to validate my understanding. 1. Reads in case 1 will be slower compared to case 2 because, in case 2 all 6 nodes can read the data in parallel from HDFS. However, if I change the file system to HDFS in Case 1, my read speeds will be conceptually the same as case 2. Correct ? 2. Once the data is loaded, case 1 will execute operations faster because there is no network overhead and all shuffle operations are local. 3. Obviously, case 1 is bad from a fault tolerance point of view because we have a single point of failure. Thanks -Soumya
Re: Spark or MR, Scala or Java?
Thanks Sean. adding user@spark.apache.org again. On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen so...@cloudera.com wrote: On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta soumya.sima...@gmail.com wrote: Is the MapReduce API simpler or the implementation? Almost, every Spark presentation has a slide that shows 100+ lines of Hadoop MR code in Java and the same feature implemented in 3 lines of Scala code on Spark. So the Spark API is certainly simpler, at least based on what I know. What am I missing here? The implementation is simpler. The API is not. However I don't think anyone 'really' uses the M/R API directly now. They use Crunch or maybe Cascading. These are also much less than 100 lines for word count, on top of M/R. Can you please expand on what you mean by efficient ? Better performance and/or reliability, fewer resources or something else? All of the above. Map/Reduce is simple and easy to understand, and Spark is actually hard to reason about, and heavy-weight. Of course, as soon as your work spans more than one MapReduce, this reasoning changes a lot. But MapReduce is better for truly map-only, or map-with-a-reduce-only, workloads. It is optimized for this case. The shuffle is still better.
Re: MongoDB Bulk Inserts
bulkLoad has the connection to MongoDB ? On Fri, Nov 21, 2014 at 4:34 PM, Benny Thompson ben.d.tho...@gmail.com wrote: I tried using RDD#mapPartitions but my job completes prematurely and without error as if nothing gets done. What I have is fairly simple sc .textFile(inputFile) .map(parser.parse) .mapPartitions(bulkLoad) But the Iterator[T] of mapPartitions is always empty, even though I know map is generating records. On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta soumya.sima...@gmail.com wrote: On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com wrote: I'm trying to use MongoDB as a destination for an ETL I'm writing in Spark. It appears I'm gaining a lot of overhead in my system databases (and possibly in the primary documents themselves); I can only assume it's because I'm left to using PairRDD.saveAsNewAPIHadoopFile. - Is there a way to batch some of the data together and use Casbah natively so I can use bulk inserts? Why cannot you write Mongo in a RDD#mapPartition ? - Is there maybe a less hacky way to load to MongoDB (instead of using saveAsNewAPIHadoopFile)? If the latency (time by which all data should be in Mongo) is not a concern you can try a separate process that uses Akka/Casbah to write from HDFS into Mongo.
Re: MongoDB Bulk Inserts
On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com wrote: I'm trying to use MongoDB as a destination for an ETL I'm writing in Spark. It appears I'm gaining a lot of overhead in my system databases (and possibly in the primary documents themselves); I can only assume it's because I'm left to using PairRDD.saveAsNewAPIHadoopFile. - Is there a way to batch some of the data together and use Casbah natively so I can use bulk inserts? Why cannot you write Mongo in a RDD#mapPartition ? - Is there maybe a less hacky way to load to MongoDB (instead of using saveAsNewAPIHadoopFile)? If the latency (time by which all data should be in Mongo) is not a concern you can try a separate process that uses Akka/Casbah to write from HDFS into Mongo.
Parsing a large XML file using Spark
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump that all revision information also) that is stored in HDFS, is it possible to parse it in parallel/faster using Spark? Or do we have to use something like a PullParser or Iteratee? My current solution is to read the single XML file in the first pass - write it to HDFS and then read the small files in parallel on the Spark workers. Thanks -Soumya
SparkSQL performance
I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya
Re: SparkSQL performance
I agree. My personal experience with Spark core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone done any comparison of SparkSQL with Impala ? The fact that many of the queries don't even finish in the benchmark is quite surprising and hard to believe. A few months ago there were a few emails about Spark not being able to handle large volumes (TBs) of data. That myth was busted recently when the folks at Databricks published their sorting record results. Thanks -Soumya On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote: We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are. Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-) Du From: Soumya Simanta soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.org user@spark.apache.org Subject: SparkSQL performance I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. I would be very interested to hear what others in the group have to say about this. Thanks -Soumya
Re: sbt/sbt compile error [FATAL]
Are you trying to compile the master branch ? Can you try branch-1.1 ? On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS hanspeter.sl...@gmail.com wrote: Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Apparently http://repo.maven.apache.org/maven2 is no longer valid. See the error further below. Is this correct? And what should it be changed to? Everything seems to go smooth until : [info] downloading https://repo1.maven.org/maven2/org/ow2/asm/asm-tree/5.0.3/asm-tree-5.0.3.jar ... [info] [SUCCESSFUL ] org.ow2.asm#asm-tree;5.0.3!asm-tree.jar (709ms) [info] Done updating. [info] Compiling 1 Scala source to /root/spark/project/spark-style/target/scala-2.10/classes... [info] Compiling 9 Scala sources to /root/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/target/scala-2.10/sbt-0.13/classes... [warn] there were 1 deprecation warning(s); re-run with -deprecation for details [warn] one warning found [info] Compiling 3 Scala sources to /root/spark/project/target/scala-2.10/sbt-0.13/classes... org.apache.maven.model.building.ModelBuildingException: 1 problem was encountered while building the effective model for org.apache.spark:spark-parent:1.2.0-SNAPSHOT [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: repo.maven.apache.org from http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 'parent.relativePath' points at wrong local POM @ line 22, column 11 Regards Hans -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-compile-error-FATAL-tp17629.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: install sbt
sbt is just a jar file. So you really don't need to install anything. Once you run the jar file (sbt-launch.jar) it can download the required dependencies. I use an executable script called sbt that has the following contents. SBT_OPTS=-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=1024M echo $SBT_OPTS java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar $@ On Tue, Oct 28, 2014 at 12:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you're just calling sbt from within the spark/sbt folder, it should download and install automatically. Nick 2014년 10월 28일 화요일, Ted Yuyuzhih...@gmail.com님이 작성한 메시지: Have you read this ? http://lancegatlin.org/tech/centos-6-install-sbt On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: Is there a repo or some kind of instruction about how to install sbt for centos? Thanks,
Re: run multiple spark applications in parallel
Try reducing the resources (cores and memory) of each application. On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote: Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: run multiple spark applications in parallel
Maybe changing --master yarn-cluster to --master yarn-client help. On Tue, Oct 28, 2014 at 7:25 PM, Josh J joshjd...@gmail.com wrote: Sorry, I should've included some stats with my email I execute each job in the following manner ./bin/spark-submit --class CLASSNAME --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 UBER.JAR ${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1 The box has 24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz 32 GB RAM Thanks, Josh On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Try reducing the resources (cores and memory) of each application. On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote: Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh
Re: scalac crash when compiling DataTypeConversions.scala
You need to change the Scala compiler from IntelliJ to “sbt incremental compiler” (see the screenshot below). You can access this by going to “preferences” “scala”. NOTE: This is supported only for certain version of IntelliJ scala plugin. See this link for details. http://blog.jetbrains.com/scala/2014/01/30/try fasterscalacompilerinintellijidea1302/ On Mon, Oct 27, 2014 at 9:04 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: guoxu1231, I struggled with the Idea problem for a full week. Same thing -- clean builds under MVN/Sbt, but no luck with IDEA. What worked for me was the solution posted higher up in this thread -- it's a SO post that basically says to delete all iml files anywhere under the project directory. Let me know if you can't see this mail and I'll locate the exact SO post On Mon, Oct 27, 2014 at 5:15 AM, guoxu1231 guoxu1...@gmail.com wrote: Hi Stephen, I tried it again, To avoid the profile impact, I execute mvn -DskipTests clean package with Hadoop 1.0.4 by default and open the IDEA and import it as a maven project, and I didn't choose any profile in the import wizard. Then Make project or re-build project in IDEA, unfortunately the DataTypeConversions.scala compile failed agian. Any updated guide for using With IntelliJ IDEA? I'm following the Building Spark with Maven in website. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scalac-crash-when-compiling-DataTypeConversions-scala-tp17083p17333.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark as Relational Database
@Peter - as Rick said - Spark's main usage is data analysis and not storage. Spark allows you to plugin different storage layers based on your use cases and quality attribute requirements. So in essence if your relational database is meeting your storage requirements you should think about how to use that with Spark. Because even if you decide not to use your relational database you will have to select some storage layer, most likely a distributed storage layer. Another option to think about is - can you possible restructure your data schema so that you don't have to do that a large number of joins?. If this is an option then you can potentially think about using stores such as Cassandra, HBase, HDFS etc. Spark really excels at processing large volumes of data really fast (given enough memory) on horizontally scalable commodity hardware. As Rick pointed out - It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. However, if you data size is much larger than your cluster memory you don't have a choice but to select a datastore. HTH -Soumya On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson rick.richard...@gmail.com wrote: Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk. What are your requirements exactly? What is massive amounts of data exactly? How big is your cluster? Note that Spark is not for data storage, only data analysis. It pulls data into working data sets called RDD's. As a migration path, you could probably pull the data out of a relational database to analyze. But in the long run, I would recommend using a more purpose built, huge storage database such as Cassandra. If your data is very static, you could also just store it in files. On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote: My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter
Re: Spark as Relational Database
1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter
Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode
I'm working a cluster where I need to start the workers separately and connect them to a master. I'm following the instructions here and using branch-1.1 http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually and I can start the master using ./sbin/start-master.sh When I try to start the slave/worker using ./sbin/start-slave.sh it does't work. The logs say that it needs the master. when I provide ./sbin/start-slave.sh spark://master-ip:7077 it still doesn't work. I can start the worker using the following command (as described in the documentation). ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT Was wondering why start-slave.sh is not working? Thanks -Soumya
Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings
I've a SchemaRDD that I want to convert to a RDD that contains String. How do I convert the Row inside the SchemaRDD to String?
Storing shuffle files on a Tachyon
Is it possible to store spark shuffle files on Tachyon ?
Creating a feature vector from text before using with MLLib
I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used). Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new Array[Double](n) val bigrams = s.sliding(2).toArray for (h - bigrams.map(_.hashCode % n)) { result(h) += 1.0 / bigrams.length } Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap)) }
Setting serializer to KryoSerializer from command line for spark-shell
Hi, I want to set the serializer for my spark-shell to Kyro. spark.serializer to org.apache.spark.serializer.KryoSerializer Can I do it without setting a new SparkConf? Thanks -Soumya
Re: Spark as a Library
It depends on what you want to do with Spark. The following has worked for me. Let the container handle the HTTP request and then talk to Spark using another HTTP/REST interface. You can use the Spark Job Server for this. Embedding Spark inside the container is not a great long term solution IMO because you may see issues when you want to connect with a Spark cluster. On Tue, Sep 16, 2014 at 11:16 AM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, Suppose I want to use Spark from an application that I already submit to run in another container (e.g. Tomcat). Is this at all possible? Or do I have to split the app into two components, and submit one to Spark and one to the other container? In that case, what is the preferred way for the two components to communicate with each other? Thanks! Best, Oliver Oliver Ruebenacker | Solutions Architect Altisource™ 290 Congress St, 7th Floor | Boston, Massachusetts 02210 P: (617) 728-5582 | ext: 275585 oliver.ruebenac...@altisource.com | www.Altisource.com *** This email message and any attachments are intended solely for the use of the addressee. If you are not the intended recipient, you are prohibited from reading, disclosing, reproducing, distributing, disseminating or otherwise using this transmission. If you have received this message in error, please promptly notify the sender by reply email and immediately delete this message from your system. This message and any attachments may contain information that is confidential, privileged or exempt from disclosure. Delivery of this message to any person other than the intended recipient is not intended to waive any right or privilege. Message transmission is not guaranteed to be secure or free of software viruses. ***
Re: About SpakSQL OR MLlib
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int, fuel:Int) 1. Create an PairedRDD of (age,Car) tuples (pairedRDD) 2. Create a new function fc //returns the interval lower and upper bound def fc(x:Int, interval:Int) : (Int,Int) = { val floor = x - (x%interval) val ceil = floor + interval (floor,ceil) } 3. do a groupBy on this RDD (step 1) by passing the function fc val myrdd = pairedRDD.groupBy( x = fun(x.age, 5) ) On Mon, Sep 15, 2014 at 11:38 PM, boyingk...@163.com boyingk...@163.com wrote: Hi: I have a dataset ,the struct [id,driverAge,TotalKiloMeter ,Emissions ,date,KiloMeter ,fuel], and the data like this: [1-980,34,221926,9,2005-2-8,123,14] [1-981,49,271321,15,2005-2-8,181,82] [1-982,36,189149,18,2005-2-8,162,51] [1-983,51,232753,5,2005-2-8,106,92] [1-984,56,45338,8,2005-2-8,156,98] [1-985,45,132060,4,2005-2-8,179,98] [1-986,40,15751,5,2005-2-8,149,77] [1-987,36,167930,17,2005-2-8,121,87] [1-988,53,44949,4,2005-2-8,195,72] [1-989,34,252867,5,2005-2-8,181,86] [1-990,53,152858,11,2005-2-8,130,43] [1-991,40,126831,11,2005-2-8,126,47] …… now ,my requirments is group by driverAge, five is a step,like 20~25 is a group,26~30 is a group? how should i do ? who can give some code? -- boyingk...@163.com
Re: Spark and Scala
An RDD is a fault-tolerant distributed structure. It is the primary abstraction in Spark. I would strongly suggest that you have a look at the following to get a basic idea. http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html http://spark.apache.org/docs/latest/quick-start.html#basics https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Take for example this: I have declared one queue *val queue = Queue.empty[Int]*, which is a pure scala line in the program. I actually want the queue to be an RDD but there are no direct methods to create RDD which is a queue right? What say do you have on this? Does there exist something like: *Create and RDD which is a queue *? On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: No, Scala primitives remain primitives. Unless you create an RDD using one of the many methods - you would not be able to access any of the RDD methods. There is no automatic porting. Spark is an application as far as scala is concerned - there is no compilation (except of course, the scala, JIT compilation etc). On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I know that unpersist is a method on RDD. But my confusion is that, when we port our Scala programs to Spark, doesn't everything change to RDDs? On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: unpersist is a method on RDDs. RDDs are abstractions introduced by Spark. An Int is just a Scala Int. You can't call unpersist on Int in Scala, and that doesn't change in Spark. On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: There is one thing that I am confused about. Spark has codes that have been implemented in Scala. Now, can we run any Scala code on the Spark framework? What will be the difference in the execution of the scala code in normal systems and on Spark? The reason for my question is the following: I had a variable *val temp = some operations* This temp was being created inside the loop, so as to manually throw it out of the cache, every time the loop ends I was calling *temp.unpersist()*, this was returning an error saying that *value unpersist is not a method of Int*, which means that temp is an Int. Can some one explain to me why I was not able to call *unpersist* on *temp*? Thank You
Re: Running Spark shell on YARN
) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 5: conf: at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.failExpecting(URI.java:2835) at java.net.URI$Parser.parse(URI.java:3038) at java.net.URI.init(URI.java:753) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 62 more Spark context available as sc. On Fri, Aug 15, 2014 at 3:49 PM, Soumya Simanta soumya.sima...@gmail.com wrote: After changing the allocation I'm getting the following in my logs. No idea what this means. 14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:34 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:35 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:36 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:37 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:38 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:39 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:40 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:41 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:42 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:43 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:44 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:45 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:46 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:47 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:48 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:49 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:50 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:51 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED On Fri, Aug 15, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote: We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks like I need to change this value. Should be upgraded
Running Spark shell on YARN
I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm doing a very simple operation - reading a file 100GB from HDFS and running a count on it. This fails due to out of memory on the executors. Can someone point to the command line parameters that I should use for spark-shell so that it? Thanks -Soumya
Re: Running Spark shell on YARN
I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread main java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) After this I set the following env variable. export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/ The program launches but then halts with the following error. *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), is above the max threshold (6144 MB) of this cluster.* I guess this is some YARN setting that is not set correctly. Thanks -Soumya On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote: Hi Soumya, The driver's console output prints out how much memory is actually granted to each executor, so from there you can verify how much memory the executors are actually getting. You should use the '--executor-memory' argument in spark-shell. For instance, assuming each node has 48G of memory, bin/spark-shell --executor-memory 46g --master yarn We leave a small cushion for the OS so we don't take up all of the entire system's memory. This option also applies to the standalone mode you've been using, but if you have been using the ec2 scripts, we set spark.executor.memory in conf/spark-defaults.conf for you automatically so you don't have to specify it each time on the command line. Of course, you can also do the same in YARN. -Andrew 2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com: I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm doing a very simple operation - reading a file 100GB from HDFS and running a count on it. This fails due to out of memory on the executors. Can someone point to the command line parameters that I should use for spark-shell so that it? Thanks -Soumya
Re: Running Spark shell on YARN
After changing the allocation I'm getting the following in my logs. No idea what this means. 14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:34 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:35 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:36 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:37 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:38 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:39 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:40 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:41 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:42 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:43 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:44 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:45 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:46 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:47 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:48 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:49 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:50 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:51 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED On Fri, Aug 15, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote: We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread main java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) After this I set the following env variable. export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/ The program launches but then halts with the following error. *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), is above the max threshold (6144 MB) of this cluster.* I guess this is some YARN setting that is not set correctly. Thanks -Soumya On Fri, Aug 15, 2014 at 2:19 PM, Andrew
Script to deploy spark to Google compute engine
Before I start doing something on my own I wanted to check if someone has created a script to deploy the latest version of Spark to Google Compute Engine. Thanks -Soumya
Re: Transform RDD[List]
Try something like this. scala val a = sc.parallelize(List(1,2,3,4,5)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala val b = sc.parallelize(List(6,7,8,9,10)) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at console:12 scala val x = a zip b x: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedRDD[3] at zip at console:16 scala val f = x.map( x = List(x._1,x._2) ) f: org.apache.spark.rdd.RDD[List[Int]] = MappedRDD[5] at map at console:18 scala f.foreach(println) List(2, 7) List(1, 6) List(5, 10) List(3, 8) List(4, 9) On Tue, Aug 12, 2014 at 12:42 AM, Kevin Jung itsjb.j...@samsung.com wrote: Hi It may be simple question, but I can not figure out the most efficient way. There is a RDD containing list. RDD ( List(1,2,3,4,5) List(6,7,8,9,10) ) I want to transform this to RDD ( List(1,6) List(2,7) List(3,8) List(4,9) List(5,10) ) And I want to achieve this without using collect method because realworld RDD can have a lot of elements then it may cause out of memory. Any ideas will be welcome. Best regards Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Transform-RDD-List-tp11948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Simple record matching using Spark SQL
Check your executor logs for the output or if your data is not big collect it in the driver and print it. On Jul 16, 2014, at 9:21 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi All, I'm trying to do a simple record matching between 2 files and wrote following code - import org.apache.spark.sql.SQLContext; import org.apache.spark.rdd.RDD object SqlTest { case class Test(fld1:String, fld2:String, fld3:String, fld4:String, fld4:String, fld5:Double, fld6:String); sc.addJar(test1-0.1.jar); val file1 = sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv); val file2 = sc.textFile(hdfs://localhost:54310/user/hduser/file2.csv); val sq = new SQLContext(sc); val file1_recs: RDD[Test] = file1.map(_.split(,)).map(l = Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6))); val file2_recs: RDD[Test] = file2.map(_.split(,)).map(s = Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6))); val file1_schema = sq.createSchemaRDD(file1_recs); val file2_schema = sq.createSchemaRDD(file2_recs); file1_schema.registerAsTable(file1_tab); file2_schema.registerAsTable(file2_tab); val matched = sq.sql(select * from file1_tab l join file2_tab s on l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and l.fld2=s.fld2); val count = matched.count(); System.out.println(Found + matched.count() + matching records); } When I run this program on a standalone spark cluster, it keeps running for long with no output or error. After waiting for few mins I'm forcibly killing it. But the same program is working well when executed from a spark shell. What is going wrong? What am I missing? ~Sarath
Re: Simple record matching using Spark SQL
When you submit your job, it should appear on the Spark UI. Same with the REPL. Make sure you job is submitted to the cluster properly. On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi Soumya, Data is very small, 500+ lines in each file. Removed last 2 lines and placed this at the end matched.collect().foreach(println);. Still no luck. It's been more than 5min, the execution is still running. Checked logs, nothing in stdout. In stderr I don't see anything going wrong, all are info messages. What else do I need check? ~Sarath On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Check your executor logs for the output or if your data is not big collect it in the driver and print it. On Jul 16, 2014, at 9:21 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi All, I'm trying to do a simple record matching between 2 files and wrote following code - *import org.apache.spark.sql.SQLContext;* *import org.apache.spark.rdd.RDD* *object SqlTest {* * case class Test(fld1:String, fld2:String, fld3:String, fld4:String, fld4:String, fld5:Double, fld6:String);* * sc.addJar(test1-0.1.jar);* * val file1 = sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv);* * val file2 = sc.textFile(hdfs://localhost:54310/user/hduser/file2.csv);* * val sq = new SQLContext(sc);* * val file1_recs: RDD[Test] = file1.map(_.split(,)).map(l = Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));* * val file2_recs: RDD[Test] = file2.map(_.split(,)).map(s = Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));* * val file1_schema = sq.createSchemaRDD(file1_recs);* * val file2_schema = sq.createSchemaRDD(file2_recs);* * file1_schema.registerAsTable(file1_tab);* * file2_schema.registerAsTable(file2_tab);* * val matched = sq.sql(select * from file1_tab l join file2_tab s on l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and l.fld2=s.fld2);* * val count = matched.count();* * System.out.println(Found + matched.count() + matching records);* *}* When I run this program on a standalone spark cluster, it keeps running for long with no output or error. After waiting for few mins I'm forcibly killing it. But the same program is working well when executed from a spark shell. What is going wrong? What am I missing? ~Sarath
Re: Simple record matching using Spark SQL
Can you try submitting a very simple job to the cluster. On Jul 16, 2014, at 10:25 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Yes it is appearing on the Spark UI, and remains there with state as RUNNING till I press Ctrl+C in the terminal to kill the execution. Barring the statements to create the spark context, if I copy paste the lines of my code in spark shell, runs perfectly giving the desired output. ~Sarath On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta soumya.sima...@gmail.com wrote: When you submit your job, it should appear on the Spark UI. Same with the REPL. Make sure you job is submitted to the cluster properly. On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi Soumya, Data is very small, 500+ lines in each file. Removed last 2 lines and placed this at the end matched.collect().foreach(println);. Still no luck. It's been more than 5min, the execution is still running. Checked logs, nothing in stdout. In stderr I don't see anything going wrong, all are info messages. What else do I need check? ~Sarath On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Check your executor logs for the output or if your data is not big collect it in the driver and print it. On Jul 16, 2014, at 9:21 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi All, I'm trying to do a simple record matching between 2 files and wrote following code - import org.apache.spark.sql.SQLContext; import org.apache.spark.rdd.RDD object SqlTest { case class Test(fld1:String, fld2:String, fld3:String, fld4:String, fld4:String, fld5:Double, fld6:String); sc.addJar(test1-0.1.jar); val file1 = sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv); val file2 = sc.textFile(hdfs://localhost:54310/user/hduser/file2.csv); val sq = new SQLContext(sc); val file1_recs: RDD[Test] = file1.map(_.split(,)).map(l = Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6))); val file2_recs: RDD[Test] = file2.map(_.split(,)).map(s = Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6))); val file1_schema = sq.createSchemaRDD(file1_recs); val file2_schema = sq.createSchemaRDD(file2_recs); file1_schema.registerAsTable(file1_tab); file2_schema.registerAsTable(file2_tab); val matched = sq.sql(select * from file1_tab l join file2_tab s on l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and l.fld2=s.fld2); val count = matched.count(); System.out.println(Found + matched.count() + matching records); } When I run this program on a standalone spark cluster, it keeps running for long with no output or error. After waiting for few mins I'm forcibly killing it. But the same program is working well when executed from a spark shell. What is going wrong? What am I missing? ~Sarath
Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result
Please look at the following. https://github.com/ooyala/spark-jobserver http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language https://github.com/EsotericSoftware/kryo You can train your model convert it to PMML and return that to your client OR You can train your model and write that model (serialized object) to the file system (local, HDFS, S3 etc) or a datastore and return a location back to the client on a successful write. On Mon, Jul 14, 2014 at 4:27 PM, Aris Vlasakakis a...@vlasakakis.com wrote: Hello Spark community, I would like to write an application in Scala that i a model server. It should have an MLlib Linear Regression model that is already trained on some big set of data, and then is able to repeatedly call myLinearRegressionModel.predict() many times and return the result. Now, I want this client application to submit a job to Spark and tell the Spark cluster job to 1) train its particular MLlib model, which produces a LinearRegression model, and then 2) take the produced Scala org.apache.spark.mllib.regression.LinearRegressionModel *object*, serialize that object, and return this serialized object over the wire to my calling application. 3) My client application receives the serialized Scala (model) object, and can call .predict() on it over and over. I am separating the heavy lifting of training the model and doing model predictions; the client application will only do predictions using the MLlib model it received from the Spark application. The confusion I have is that I only know how to submit jobs to Spark by using the bin/spark-submit script, and then the only output I receive is stdout (as in, text). I want my scala appliction to hopefully submit the spark model-training programmatically, and for the Spark application to return a SERIALIZED MLLIB OBJECT, not just some stdout text! How can I do this? I think my use case of separating long-running jobs to Spark and using it's libraries in another application should be a pretty common design pattern. Thanks! -- Άρης Βλασακάκης Aris Vlasakakis
Re: How to separate a subset of an RDD by day?
If you are on 1.0.0 release you can also try converting your RDD to a SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield better results. It's worth a try at least. On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Solution 2 is to map the objects into a pair RDD where the key is the number of the day in the interval, then group by key, collect, and parallelize the resulting grouped data. However, I worry collecting large data sets is going to be a serious performance bottleneck. Why do you have to do a collect ? You can do a groupBy and then write the grouped data to disk again
Re: How to separate a subset of an RDD by day?
I think my best option is to partition my data in directories by day before running my Spark application, and then direct my Spark application to load RDD's from each directory when I want to load a date range. How does this sound? If your upstream system can write data by day then it makes perfect sense to do that and load (into Spark) only the data that is required for processing. This also saves you the filter step and hopefully time and memory. If you want to get back the bigger dataset you can always join multiple days of data (RDDs) together.
Re: Streaming training@ Spark Summit 2014
Do you have a proxy server ? If yes you need to set the proxy for twitter4j On Jul 11, 2014, at 7:06 PM, SK skrishna...@gmail.com wrote: I dont get any exceptions or error messages. I tried it both with and without VPN and had the same outcome. But I can try again without VPN later today and report back. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summit-2014-tp9465p9477.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Streaming training@ Spark Summit 2014
Try running a simple standalone program if you are using Scala and see if you are getting any data. I use this to debug any connection/twitter4j issues. import twitter4j._ //put your keys and creds here object Util { val config = new twitter4j.conf.ConfigurationBuilder() .setOAuthConsumerKey() .setOAuthConsumerSecret() .setOAuthAccessToken() .setOAuthAccessTokenSecret() .build } /** * Add this to your build.sbt * org.twitter4j % twitter4j-stream % 3.0.3, */ object SimpleStreamer extends App { def simpleStatusListener = new StatusListener() { def onStatus(status: Status) { println(status.getUserMentionEntities.length) } def onDeletionNotice(statusDeletionNotice: StatusDeletionNotice) {} def onTrackLimitationNotice(numberOfLimitedStatuses: Int) {} def onException(ex: Exception) { ex.printStackTrace } def onScrubGeo(arg0: Long, arg1: Long) {} def onStallWarning(warning: StallWarning) {} } val keywords = List(dog, cat) val twitterStream = new TwitterStreamFactory(Util.config).getInstance twitterStream.addListener(simpleStatusListener) twitterStream.filter(new FilterQuery().track(keywords.toArray)) } On Fri, Jul 11, 2014 at 7:19 PM, SK skrishna...@gmail.com wrote: I dont have a proxy server. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summit-2014-tp9465p9481.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Comparative study
Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Spark Summit 2014 Day 2 Video Streams?
Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: *General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1 http://www.ustream.tv/channel/track-a1Track B: http://www.ustream.tv/channel/track-b1 http://www.ustream.tv/channel/track-b1 Track C: http://www.ustream.tv/channel/track-c1 http://www.ustream.tv/channel/track-c1* On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com wrote: I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: Spark Summit 2014 Day 2 Video Streams?
Awesome. Just want to catch up on some sessions from other tracks. Learned a ton over the last two days. Thanks Soumya On Jul 1, 2014, at 8:50 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, we’re going to try to get the videos up as soon as possible. Matei On Jul 1, 2014, at 7:47 PM, Marco Shaw marco.s...@gmail.com wrote: They are recorded... For example, 2013: http://spark-summit.org/2013 I'm assuming the 2014 videos will be up in 1-2 weeks. Marco On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1 Track B: http://www.ustream.tv/channel/track-b1 Track C: http://www.ustream.tv/channel/track-c1 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com wrote: I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: Spark streaming and rate limit
You can add a back pressured enabled component in front that feeds data into Spark. This component can control in input rate to spark. On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, in my use case I'd like to receive events and call an external service as they pass through. Is it possible to limit the number of contemporaneous call to that service (to avoid DoS) using Spark streaming? if so, limiting the rate implies a possible buffer growth...how can I control the buffer of incoming events waiting to be processed? Best, Flavio
Re: Spark streaming and rate limit
Flavio - i'm new to Spark as well but I've done stream processing using other frameworks. My comments below are not spark-streaming specific. Maybe someone who know more can provide better insights. I read your post on my phone and I believe my answer doesn't completely address the issue you have raised. Do you need to call the external service for every event ? i.e., do you need to process all events ? Also does order of processing events matter? Is there is time bound in which each event should be processed ? Calling an external service means network IO. So you have to buffer events if your service is rate limited or slower than rate at which you are processing your event. Here are some ways of dealing with this situation: 1. Drop events based on a policy (such as buffer/queue size), 2. Tell the event producer to slow down if that's in your control 3. Use a proxy or a set of proxies to distribute the calls to the remote service, if the rate limit is by user or network node only. I'm not sure how many of these are implemented directly in Spark streaming but you can have an external component that can : control the rate of event and only send events to Spark streams when it's ready to process more messages. Hope this helps. -Soumya On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Thanks for the quick reply soumya. Unfortunately I'm a newbie with Spark..what do you mean? is there any reference to how to do that? On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta soumya.sima...@gmail.com wrote: You can add a back pressured enabled component in front that feeds data into Spark. This component can control in input rate to spark. On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, in my use case I'd like to receive events and call an external service as they pass through. Is it possible to limit the number of contemporaneous call to that service (to avoid DoS) using Spark streaming? if so, limiting the rate implies a possible buffer growth...how can I control the buffer of incoming events waiting to be processed? Best, Flavio
Re: Unable to run a Standalone job
Try cleaning your maven (.m2) and ivy cache. On May 23, 2014, at 12:03 AM, Shrikar archak shrika...@gmail.com wrote: Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1. Thanks, Shrikar On Thu, May 22, 2014 at 8:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote: How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you publish Spark locally which allowed you to use it as a dependency? This is a weird indeed. SBT should take care of all the dependencies of spark. In any case, you can try the last released Spark 0.9.1 and see if the problem persists. On Thu, May 22, 2014 at 3:59 PM, Shrikar archak shrika...@gmail.com wrote: I am running as sbt run. I am running it locally . Thanks, Shrikar On Thu, May 22, 2014 at 3:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote: How are you launching the application? sbt run ? spark-submit? local mode or Spark standalone cluster? Are you packaging all your code into a jar? Looks to me that you seem to have spark classes in your execution environment but missing some of Spark's dependencies. TD On Thu, May 22, 2014 at 2:27 PM, Shrikar archak shrika...@gmail.com wrote: Hi All, I am trying to run the network count example as a seperate standalone job and running into some issues. Environment: 1) Mac Mavericks 2) Latest spark repo from Github. I have a structure like this Shrikars-MacBook-Pro:SimpleJob shrikar$ find . . ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/NetworkWordCount.scala ./src/main/scala/SimpleApp.scala.bk simple.sbt name := Simple Project version := 1.0 scalaVersion := 2.10.3 libraryDependencies ++= Seq(org.apache.spark %% spark-core % 1.0.0-SNAPSHOT, org.apache.spark %% spark-streaming % 1.0.0-SNAPSHOT) resolvers += Akka Repository at http://repo.akka.io/releases/; I am able to run the SimpleApp which is mentioned in the doc but when I try to run the NetworkWordCount app I get error like this am I missing something? [info] Running com.shrikar.sparkapps.NetworkWordCount 14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to: shrikar 14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(shrikar) 14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/05/22 14:26:48 INFO Remoting: Starting remoting 14/05/22 14:26:48 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@192.168.10.88:49963] 14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@192.168.10.88:49963] 14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker 14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster 14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local directory at /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14 14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with capacity 911.6 MB. 14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port 49964 with id = ConnectionManagerId(192.168.10.88,49964) 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block manager 192.168.10.88:49964 with 911.6 MB RAM 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered BlockManager 14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server [error] (run-main) java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse at org.apache.spark.HttpServer.start(HttpServer.scala:54) at org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127) at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31) at org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48) at org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:549) at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:561) at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:91) at com.shrikar.sparkapps.NetworkWordCount$.main(NetworkWordCount.scala:39) at com.shrikar.sparkapps.NetworkWordCount.main(NetworkWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
Re: Run Apache Spark on Mini Cluster
Suggestion - try to get an idea of your hardware requirements by running a sample on Amazon's EC2 or Google compute engine. It's relatively easy (and cheap) to get started on the cloud before you invest in your own hardware IMO. On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar upent...@gmail.comwrote: Hi, I would like to setup apache platform on a mini cluster. Is there any recommendation for the hardware that I can buy to set it up. I am thinking about processing significant amount of data like in the range of few terabytes. Thanks Upender
Re: Historical Data as Stream
@Laeeq - please see this example. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala#L47-L49 On Sat, May 17, 2014 at 2:06 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: @Soumya Simanta Right now its just a prove of concept. Later I will have a real stream. Its EEG files of brain. Later it can be used for real time analysis of eeg streams. @Mayur The size is huge yes. SO its better to do in distributed manner and as I said above I want to read as stream because later i will have stream data. This is a prove a concept. Regards, Laeeq On Saturday, May 17, 2014 7:03 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: The real question is why are looking to consume file as a Stream 1. Too big to load as RDD 2. Operate in sequential manner. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 17, 2014 at 5:12 AM, Soumya Simanta soumya.sima...@gmail.comwrote: File is just a steam with a fixed length. Usually streams don't end but in this case it would. On the other hand if you real your file as a steam may not be able to use the entire data in the file for your analysis. Spark (give enough memory) can process large amounts of data quickly. On May 15, 2014, at 9:52 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, I have data in a file. Can I read it as Stream in spark? I know it seems odd to read file as stream but it has practical applications in real life if I can read it as stream. It there any other tools which can give this file as stream to Spark or I have to make batches manually which is not what I want. Its a coloumn of a million values. Regards, Laeeq
Re: Proper way to create standalone app with custom Spark version
Install your custom spark jar to your local maven or ivy repo. Use this custom jar in your pom/sbt file. On May 15, 2014, at 3:28 AM, Andrei faithlessfri...@gmail.com wrote: (Sorry if you have already seen this message - it seems like there were some issues delivering messages to the list yesterday) We can create standalone Spark application by simply adding spark-core_2.x to build.sbt/pom.xml and connecting it to Spark master. We can also build custom version of Spark (e.g. compiled against Hadoop 2.x) from source and deploy it to cluster manually. But what is a proper way to use _custom version_ of Spark in _standalone application_? I'm currently trying to deploy custom version to local Maven repository and add it to SBT project. Another option is to add Spark as local jar to every project. But both of these ways look overcomplicated and in general wrong. So what is the implied way to do it? Thanks, Andrei
Re: Historical Data as Stream
File is just a steam with a fixed length. Usually streams don't end but in this case it would. On the other hand if you real your file as a steam may not be able to use the entire data in the file for your analysis. Spark (give enough memory) can process large amounts of data quickly. On May 15, 2014, at 9:52 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, I have data in a file. Can I read it as Stream in spark? I know it seems odd to read file as stream but it has practical applications in real life if I can read it as stream. It there any other tools which can give this file as stream to Spark or I have to make batches manually which is not what I want. Its a coloumn of a million values. Regards, Laeeq
Is there a way to load a large file from HDFS faster into Spark
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS it takes a while (4-5 minutes) load it into an RDD. If I use the .cache operation it takes even longer. Is there a way to make loading of the RDD from HDFS faster ? Thanks -Soumya
Re: Fwd: Is there a way to load a large file from HDFS faster into Spark
Yep. I figured that out. I uncompressed the file and it looks much faster now. Thanks. On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: .gz files are not splittable hence harder to process. Easiest is to move to a splittable compression like lzo and break file into multiple blocks to be read and for subsequent processing. On 11 May 2014 09:01, Soumya Simanta soumya.sima...@gmail.com wrote: I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS it takes a while (4-5 minutes) load it into an RDD. If I use the .cache operation it takes even longer. Is there a way to make loading of the RDD from HDFS faster ? Thanks -Soumya
Re: How to use spark-submit
Will sbt-pack and the maven solution work for the Scala REPL? I need the REPL because it save a lot of time when I'm playing with large data sets because I load then once, cache them and then try out things interactively before putting in a standalone driver. I've sbt woking for my own driver program on Spark 0.9. On May 11, 2014, at 3:49 PM, Stephen Boesch java...@gmail.com wrote: Just discovered sbt-pack: that addresses (quite well) the last item for identifying and packaging the external jars. 2014-05-11 12:34 GMT-07:00 Stephen Boesch java...@gmail.com: HI Sonal, Yes I am working towards that same idea. How did you go about creating the non-spark-jar dependencies ? The way I am doing it is a separate straw-man project that does not include spark but has the external third party jars included. Then running sbt compile:managedClasspath and reverse engineering the lib jars from it. That is obviously not ideal. The maven run will be useful for other projects built by maven: i will keep in my notes. AFA sbt run-example, it requires additional libraries to be added for my external dependencies. I tried several items including ADD_JARS, --driver-class-path and combinations of extraClassPath. I have deferred that ad-hoc approach to finding a systematic one. 2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com: I am creating a jar with only my dependencies and run spark-submit through my project mvn build. I have configured the mvn exec goal to the location of the script. Here is how I have set it up for my app. The mainClass is my driver program, and I am able to send my custom args too. Hope this helps. plugin groupIdorg.codehaus.mojo/groupId artifactIdexec-maven-plugin/artifactId executions execution goals goalexec/goal /goals /execution /executions configuration executable/home/sgoyal/spark/bin/spark-submit/executable arguments argument${jars}/argument argument--class/argument argument${mainClass}/argument argument--arg/argument argument${spark.master}/argument argument--arg/argument argument${my app arg 1}/argument argument--arg/argument argument${my arg 2}/argument /arguments /configuration /plugin Best Regards, Sonal Nube Technologies On Wed, May 7, 2014 at 6:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Doesnt the run-example script work for you? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote: I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what are the ---driver-class-path and/or -Dspark.executor.extraClassPath parameters required? For reference, the following error is proving difficult to resolve: java.lang.ClassNotFoundException: org.apache.spark.streaming.examples.StreamingExamples
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
I just upgraded my Spark version to 1.0.0_SNAPSHOT. commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3 Author: witgo wi...@qq.com Date: Fri May 2 12:40:27 2014 -0700 I'm running a standalone cluster with 3 workers. - *Workers:* 3 - *Cores:* 48 Total, 0 Used - *Memory:* 469.8 GB Total, 0.0 B Used However, when I try to run bin/spark-shell I get the following error after sometime even if I don't perform any operations on the Spark shell. 14/05/05 10:20:52 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:256) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:54) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) *Caused by: java.lang.OutOfMemoryError: unable to create new native thread* at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:679) at java.lang.UNIXProcess$1.run(UNIXProcess.java:157) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.init(UNIXProcess.java:119) at java.lang.ProcessImpl.start(ProcessImpl.java:81) at java.lang.ProcessBuilder.start(ProcessBuilder.java:470) at java.lang.Runtime.exec(Runtime.java:612) at java.lang.Runtime.exec(Runtime.java:485) at scala.tools.jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:178) at scala.tools.jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:168) at scala.tools.jline.internal.TerminalLineSettings.stty(TerminalLineSettings.java:163) at scala.tools.jline.internal.TerminalLineSettings.get(TerminalLineSettings.java:67) at scala.tools.jline.internal.TerminalLineSettings.getProperty(TerminalLineSettings.java:87) at scala.tools.jline.UnixTerminal.readVirtualKey(UnixTerminal.java:127) at scala.tools.jline.console.ConsoleReader.readVirtualKey(ConsoleReader.java:933) at org.apache.spark.repl.SparkJLineReader$JLineConsoleReader.readOneKey(SparkJLineReader.scala:54) at org.apache.spark.repl.SparkJLineReader.readOneKey(SparkJLineReader.scala:81) at scala.tools.nsc.interpreter.InteractiveReader$class.readYesOrNo(InteractiveReader.scala:29) at org.apache.spark.repl.SparkJLineReader.readYesOrNo(SparkJLineReader.scala:25) at org.apache.spark.repl.SparkILoop$$anonfun$1.org $apache$spark$repl$SparkILoop$$anonfun$$fn$1(SparkILoop.scala:576) at org.apache.spark.repl.SparkILoop$$anonfun$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$fn$1$1.apply$mcZ$sp(SparkILoop.scala:576) at scala.tools.nsc.interpreter.InteractiveReader$class.readYesOrNo(InteractiveReader.scala:32) at org.apache.spark.repl.SparkJLineReader.readYesOrNo(SparkJLineReader.scala:25) at org.apache.spark.repl.SparkILoop$$anonfun$1.org $apache$spark$repl$SparkILoop$$anonfun$$fn$1(SparkILoop.scala:576) at org.apache.spark.repl.SparkILoop$$anonfun$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$fn$1$1.apply$mcZ$sp(SparkILoop.scala:576) at scala.tools.nsc.interpreter.InteractiveReader$class.readYesOrNo(InteractiveReader.scala:32) at org.apache.spark.repl.SparkJLineReader.readYesOrNo(SparkJLineReader.scala:25) at org.apache.spark.repl.SparkILoop$$anonfun$1.org $apache$spark$repl$SparkILoop$$anonfun$$fn$1(SparkILoop.scala:576) at org.apache.spark.repl.SparkILoop$$anonfun$1.applyOrElse(SparkILoop.scala:579) at org.apache.spark.repl.SparkILoop$$anonfun$1.applyOrElse(SparkILoop.scala:566) at scala.runtime.AbstractPartialFunction$mcZL$sp.apply$mcZL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:983) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) ... 7 more
Problem with sharing class across worker nodes using spark-shell on Spark 1.0.0
Hi, I'm trying to run a simple Spark job that uses a 3rd party class (in this case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT I'm starting my bin/spark-shell with the following command. ./spark-shell *--driver-class-path*$LIBPATH/jodatime2.3/joda-convert-1.2.jar:$LIBPATH/jodatime2.3/joda-time-2.3/joda-time-2.3.jar:$LIBPATH/twitter4j-core-3.0.5.jar *--jars* $LIBPATH/jodatime2.3/joda-convert-1.2.jar,$LIBPATH/jodatime2.3/joda-time-2.3/joda-time-2.3.jar,$LIBPATH/twitter4j-core-3.0.5.jar My code was working fine in 0.9.1 when I used the following options that were pointing to the same jar above. export SPARK_CLASSPATH export ADD_JAR Now I'm getting a NoClassDefFoundError on each of my worker nodes 14/05/05 14:03:30 INFO TaskSetManager: Loss was due to java.lang.NoClassDefFoundError: twitter4j/Status [duplicate 40] 14/05/05 14:03:30 INFO TaskSetManager: Starting task 0.0:26 as TID 73 on executor 2: *worker1.xxx..* (NODE_LOCAL) What am I missing here? Thanks -Soumya
Re: How to use spark-submit
Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote: I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what are the ---driver-class-path and/or -Dspark.executor.extraClassPath parameters required? For reference, the following error is proving difficult to resolve: java.lang.ClassNotFoundException: org.apache.spark.streaming.examples.StreamingExamples
Re: Announcing Spark SQL
Very nice. Any plans to make the SQL typesafe using something like Slick ( http://slick.typesafe.com/) Thanks ! On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote: Hey Everyone, This already went out to the dev list, but I wanted to put a pointer here as well to a new feature we are pretty excited about for Spark 1.0. http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html Michael
Help with building and running examples with GraphX from the REPL
I'm not able to run the GraphX examples from the Scala REPL. Can anyone point to the correct documentation that talks about the configuration and/or how to build GraphX for the REPL ? Thanks