Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread M. Dale
Try spark.yarn.user.classpath.first (see https://issues.apache.org/jira/browse/SPARK-2996 - only works for YARN). Also thread at http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html. HTH, Markus On 02/03/2015 11:20 PM, Corey Nolet wrote:

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread bo yang
Corey, Which version of Spark do you use? I am using Spark 1.2.0, and guava 15.0. It seems fine. Best, Bo On Tue, Feb 3, 2015 at 8:56 PM, M. Dale medal...@yahoo.com.invalid wrote: Try spark.yarn.user.classpath.first (see https://issues.apache.org/jira/browse/SPARK-2996 - only works for

“mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread Corey Nolet
I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my

HiveContext in SparkSQL - concurrency issues

2015-02-03 Thread matha.harika
Hi, I've been trying to use HiveContext(instead of SQLContext) in my SparkSQL application and when I run the application simultaneously, it only works on the first call and every other call throws the following error- ERROR Datastore.Schema: Failed initialising database. Failed to start

Multiple running SparkContexts detected in the same JVM!

2015-02-03 Thread gavin zhang
I have a cluster which running CDH5.1.0 with Spark component. Because the default version of Spark from CDH5.1.0 is 1.0.0 while I want to use some features of Spark 1.2.0, I compiled another Spark with Maven. But when I run into Spark-shell and created a new SparkContext, I met the below error:

Re: StackOverflowError on RDD.union

2015-02-03 Thread Mark Hamstra
Use SparkContext#union[T](rdds: Seq[RDD[T]]) On Tue, Feb 3, 2015 at 7:43 PM, Thomas Kwan thomas.k...@manage.com wrote: I am trying to combine multiple RDDs into 1 RDD, and I am using the union function. I wonder if anyone has seen StackOverflowError as follows: Exception in thread main

Re: connector for CouchDB

2015-02-03 Thread hnahak
Spark Doesn't support it, but this connector is open source, you can get it from github. The difference between these two DB is depending on what type of solution you are looking for. Please refer this link : http://blog.nahurst.com/visual-guide-to-nosql-systems FYI, from the list of NOSQL in

StackOverflowError on RDD.union

2015-02-03 Thread Thomas Kwan
I am trying to combine multiple RDDs into 1 RDD, and I am using the union function. I wonder if anyone has seen StackOverflowError as follows: Exception in thread main java.lang.StackOverflowError at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread McNerlin, Andrew (Agoda)
Hi Sean, I'm interested in trying something similar. How was your performance when you had many concurrent queries running against spark? I know this will work well where you have a low volume of queries against a large dataset, but am concerned about having a high volume of queries against

Re: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Denny Lee
Hi Ningjun, I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely for development purposes). I had most recently installed them utilizing Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy thread concerning the null\bin\winutils issue is addressed

Spark SQL taking long time to print records from a table

2015-02-03 Thread jguliani
I have 3 text files in hdfs which I am reading using spark sql and registering them as table. After that I am doing almost 5-6 operations - including joins , group by etc.. And this whole process is taking hardly 6-7 secs. ( Source File size - 3 GB with almost 20 million rows ). As a final step of

Re: How to define a file filter for file name patterns in Apache Spark Streaming in Java?

2015-02-03 Thread Emre Sevinc
Hello Akhil, Thank you for taking your time for a detailed answer. I managed to solve it in a very similar manner. Kind regards, Emre Sevinç On Mon, Feb 2, 2015 at 8:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Emre, This is how you do that in scala: val lines =

Re: Spark Master Build Failing to run on cluster in standalone ClassNotFoundException: javax.servlet.FilterRegistration

2015-02-03 Thread Sean Owen
Already come up several times today: https://issues.apache.org/jira/browse/SPARK-5557 On Tue, Feb 3, 2015 at 8:04 AM, Night Wolf nightwolf...@gmail.com wrote: Hi, I just built Spark 1.3 master using maven via make-distribution.sh; ./make-distribution.sh --name mapr3 --skip-java-test --tgz

Is LogisticRegressionWithSGD in MLlib scalable?

2015-02-03 Thread Peng Zhang
Hi Everyone, Is LogisticRegressionWithSGD in MLlib scalable? If so, what is the idea behind the scalable implementation? Thanks in advance, Peng - Peng Zhang -- View this message in context:

Unable to run spark-shell after build

2015-02-03 Thread Jaonary Rabarisoa
Hi all, I'm trying to run the master version of spark in order to test some alpha components in ml package. I follow the build spark documentation and build it with : $ mvn clean package The build is successful but when I try to run spark-shell I got the following errror : *Exception in

Pig loader in Spark

2015-02-03 Thread Jianshi Huang
Hi, Anyone has implemented the default Pig Loader in Spark? (loading delimited text files with .pig_schema) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

connecting spark with ActiveMQ

2015-02-03 Thread Mohit Durgapal
Hi All, I have a requirement where I need to consume messages from ActiveMQ and do live stream processing as well as batch processing using Spark. Is there a spark-plugin or library that can enable this? If not, then do you know any other way this could be done? Regards Mohit

RE: Sort based shuffle not working properly?

2015-02-03 Thread Mohammed Guller
Nitin, Suing Spark is not going to help. Perhaps you should sue someone else :-) Just kidding! Mohammed -Original Message- From: nitinkak001 [mailto:nitinkak...@gmail.com] Sent: Tuesday, February 3, 2015 1:57 PM To: user@spark.apache.org Subject: Re: Sort based shuffle not working

Re: Sort based shuffle not working properly?

2015-02-03 Thread Sean Owen
Hm, I don't think the sort partitioner is going to cause the result to be ordered by c1,c2 if you only partitioned on c1. I mean, it's not even guaranteed that the type of c2 has an ordering, right? On Tue, Feb 3, 2015 at 3:38 PM, nitinkak001 nitinkak...@gmail.com wrote: I am trying to implement

RE: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Wang, Ningjun (LNG-NPV)
Hi Gen Thanks for your feedback. We do have a business reason to run spark on windows. We have an existing application that is built on C# .NET running on windows. We are considering adding spark to the application for parallel processing of large data. We want spark to run on windows so it

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
You could also just push the data to Amazon S3, which would un-link the size of the cluster needed to process the data from the size of the data. DR On 02/03/2015 11:43 AM, Joe Wass wrote: I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS

Spark (SQL) as OLAP engine

2015-02-03 Thread Adamantios Corais
Hi, After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine. My goal is to push aggregated data (to Cassandra or other low-latency data storage) and then be able to project the results on a web page (web service). New data will be added (aggregated) once a

Re: Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-02-03 Thread maven
The version I'm using was already pre-built for Hadoop 2.3. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p21485.html Sent from the Apache Spark User List mailing list archive at

ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS somehow. I currently have a cluster of 5 x m3.xlarge, each of which has 80GB disk. Each HDFS node reports 73 GB, and the total capacity is ~370 GB. If I want to process 800 GB of data (assuming

Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread Adamantios Corais
Hi, I am using Spark 0.9.1 and I am looking for a proper viz tools that supports that specific version. As far as I have seen all relevant tools (e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no mentions about older versions of Spark. Any ideas or suggestions? *//

Re: Writing RDD to a csv file

2015-02-03 Thread Gerard Maas
this is more of a scala question, so probably next time you'd like to address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala val optArrStr:Option[Array[String]] = ??? optArrStr.map(arr = arr.mkString(,)).getOrElse() // empty string or whatever default value you have for this.

Writing RDD to a csv file

2015-02-03 Thread kundan kumar
I have a RDD which is of type org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] I want to write it as a csv file. Please suggest how this can be done. myrdd.map(line = (line._1 + , + line._2._1.mkString(,) + , + line._2._2.mkString(','))).saveAsTextFile(hdfs://...)

Re: Writing RDD to a csv file

2015-02-03 Thread kundan kumar
Thanks Gerard !! This is working. On Tue, Feb 3, 2015 at 6:44 PM, Gerard Maas gerard.m...@gmail.com wrote: this is more of a scala question, so probably next time you'd like to address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala val optArrStr:Option[Array[String]] =

Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
Hello Adamantios, Thanks for the poke and the interest. Actually, you're the second asking about backporting it. Yesterday (late), I created a branch for it... and the simple local spark test worked! \o/. However, it'll be the 'old' UI :-/. Since I didn't ported the code using play 2.2.6 to the

Spark Master Build Failing to run on cluster in standalone ClassNotFoundException: javax.servlet.FilterRegistration

2015-02-03 Thread Night Wolf
Hi, I just built Spark 1.3 master using maven via make-distribution.sh; ./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive -Phive-thriftserver -Phive-0.12.0 When trying to start the standalone spark master on a cluster I get the following stack trace; 15/02/04 08:53:56

Re: Unable to run spark-shell after build

2015-02-03 Thread Sean Owen
Yes, I see this too. I think the Jetty shading still needs a tweak. It's not finding the servlet API classes. Let's converge on SPARK-5557 to discuss. On Tue, Feb 3, 2015 at 2:04 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying to run the master version of spark in order to

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-03 Thread Jay Hutfles
I think this is a separate issue with how the EdgeRDDImpl partitions edges. If you can merge this change in and rebuild, it should work: https://github.com/apache/spark/pull/4136/files If you can't, I just called the Graph.partitonBy() method right after construction my graph but before

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
The data is coming from S3 in the first place, and the results will be uploaded back there. But even in the same availability zone, fetching 170 GB (that's gzipped) is slow. From what I understand of the pipelines, multiple transforms on the same RDD might involve re-reading the input, which very

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
We use S3 as a main storage for all our input data and our generated (output) data. (10's of terabytes of data daily.) We read gzipped data directly from S3 in our Hadoop/Spark jobs - it's not crazily slow, as long as you parallelize the work well by distributing the processing across enough

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Ted Yu
Using s3a protocol (introduced in hadoop 2.6.0) would be faster compared to s3. The upcoming hadoop 2.7.0 contains some bug fixes for s3a. FYI On Tue, Feb 3, 2015 at 9:48 AM, David Rosenstrauch dar...@darose.net wrote: We use S3 as a main storage for all our input data and our generated

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-03 Thread Manoj Samel
Hi, Any thoughts ? Thanks, On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 SchemaRDD has schema with decimal columns created like x1 = new StructField(a, DecimalType(14,4), true) x2 = new StructField(b, DecimalType(14,4), true) Registering as SQL

Kyro serialization and OOM

2015-02-03 Thread Joe Wass
I have about 500 MB of data and I'm trying to process it on a single `local` instance. I'm getting an Out of Memory exception. Stack trace at the end. Spark 1.1.1 My JVM has --Xmx2g spark.driver.memory = 1000M spark.executor.memory = 1000M spark.kryoserializer.buffer.mb = 256

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Sean McNamara
We have gone down a similar path at Webtrends, Spark has worked amazingly well for us in this use case. Our solution goes from REST, directly into spark, and back out to the UI instantly. Here is the resulting product in case you are curious (and please pardon the self promotion):

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Joe Wass
Thanks very much, that's good to know, I'll certainly give it a look. Can you give me a hint about you unzip your input files on the fly? I thought that it wasn't possible to parallelize zipped inputs unless they were unzipped before passing to Spark? Joe On 3 February 2015 at 17:48, David

GraphX pregel: getting the current iteration number

2015-02-03 Thread Matthew Cornell
Hi Folks, I'm new to GraphX and Scala and my sendMsg function needs to index into an input list to my algorithm based on the pregel()() iteration number, but I don't see a way to access that. I see in

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Jonathan Haddad
Write out the rdd to a cassandra table. The datastax driver provides saveToCassandra() for this purpose. On Tue Feb 03 2015 at 8:59:15 AM Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine.

Re: Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Michael Armbrust
I'll add i usually just do println(query.queryExecution) On Tue, Feb 3, 2015 at 11:34 AM, Michael Armbrust mich...@databricks.com wrote: You should be able to do something like: sbt -Dscala.repl.maxprintstring=64000 hive/console Here's an overview of catalyst:

Re: Supported Notebooks (and other viz tools) for Spark 0.9.1?

2015-02-03 Thread andy petrella
Adamantios, As said, I backported it to 0.9.x and now it's pushed on this branch: https://github.com/andypetrella/spark-notebook/tree/spark-0.9.x. I didn't created dist atm, because I'd prefer to do it only if necessary :-). So, if you want to try it out, just clone the repo, checked out in this

Re: GraphX pregel: getting the current iteration number

2015-02-03 Thread Daniil Osipov
I don't think its possible to access. What I've done before is send the current or next iteration index with the message, where the message is a case class. HTH Dan On Tue, Feb 3, 2015 at 10:20 AM, Matthew Cornell corn...@cs.umass.edu wrote: Hi Folks, I'm new to GraphX and Scala and my

Re: Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Michael Armbrust
You should be able to do something like: sbt -Dscala.repl.maxprintstring=64000 hive/console Here's an overview of catalyst: https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit#heading=h.vp2tej73rtm2 On Tue, Feb 3, 2015 at 1:37 AM, Mick Davies

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
Not all of our input files are zipped. The ones that are obviously are not parallelized - they're just processed by a single task. Not a big issue for us, though, as the those zipped files aren't too big. DR On 02/03/2015 01:08 PM, Joe Wass wrote: Thanks very much, that's good to know,

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
This is an exerpt from the Design document of the implementation of Sort based shuffle.. I am thinking I might be wrong in my perception of sort based shuffle. Dont completely understand it though. *Motivation* A sort­based shuffle can be more scalable than Spark’s current hash­based one because

Re: 2GB limit for partitions?

2015-02-03 Thread Aaron Davidson
To be clear, there is no distinction between partitions and blocks for RDD caching (each RDD partition corresponds to 1 cache block). The distinction is important for shuffling, where by definition N partitions are shuffled into M partitions, creating N*M intermediate blocks. Each of these blocks

Re: 2GB limit for partitions?

2015-02-03 Thread Michael Albert
Thank you! This is very helpful. -Mike From: Aaron Davidson ilike...@gmail.com To: Imran Rashid iras...@cloudera.com Cc: Michael Albert m_albert...@yahoo.com; Sean Owen so...@cloudera.com; user@spark.apache.org user@spark.apache.org Sent: Tuesday, February 3, 2015 6:13 PM Subject: Re:

Re: 2GB limit for partitions?

2015-02-03 Thread Imran Rashid
Thanks for the explanations, makes sense. For the record looks like this was worked on a while back (and maybe the work is even close to a solution?) https://issues.apache.org/jira/browse/SPARK-1476 and perhaps an independent solution was worked on here?

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
I thought thats what sort based shuffled did, sort the keys going to the same partition. I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that ordering of c2 type is the problem here. On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen so...@cloudera.com wrote: Hm, I don't think the sort

Re: 2GB limit for partitions?

2015-02-03 Thread Reynold Xin
cc dev list How are you saving the data? There are two relevant 2GB limits: 1. Caching 2. Shuffle For caching, a partition is turned into a single block. For shuffle, each map partition is partitioned into R blocks, where R = number of reduce tasks. It is unlikely a shuffle block 2G,

Re: 2GB limit for partitions?

2015-02-03 Thread Imran Rashid
Michael, you are right, there is definitely some limit at 2GB. Here is a trivial example to demonstrate it: import org.apache.spark.storage.StorageLevel val d = sc.parallelize(1 to 1e6.toInt, 1).map{i = new Array[Byte](5e3.toInt)}.persist(StorageLevel.DISK_ONLY) d.count() It gives the same

Setting maxPrintString in Spark Repl to view SQL query plans

2015-02-03 Thread Mick Davies
Hi, I want to increase the maxPrintString the Spark repl to look at SQL query plans, as they are truncated by default at 800 chars, but don't know how to set this. You don't seem to be able to do it in the same way as you would with with Scala repl. Anyone know how to set this? Also anyone

Re: Spark streaming - tracking/deleting processed files

2015-02-03 Thread Prannoy
Hi, To keep processing the older file also you can use fileStream instead of textFileStream. It has a parameter to specify to look for already present files. For deleting the processed files one way is to get the list of all files in the dStream. This can be done by using the foreachRDD api of

Re: 2GB limit for partitions?

2015-02-03 Thread Michael Albert
Greetings! Thanks for the response. Below is an example of the exception I saw.I'd rather not post code at the moment, so I realize it is completely unreasonable to ask for a diagnosis.However, I will say that adding a partitionBy() was the last change before this error was created. Thanks for

Re: Spark Shell Timeouts

2015-02-03 Thread amoners
I am not sure that this way can help you. There is my situation that I can not see any input in terminal after some work gets done via spark-shell, I used to run a command stty echo , and It fixed. Best, Amoners -- View this message in context:

LeaseExpiredException while writing schemardd to hdfs

2015-02-03 Thread Hafiz Mujadid
I want to write whole schemardd to single in hdfs but facing following exception rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder DFSClient_NONMAPREDUCE_-564238432_57

Re: Writing RDD to a csv file

2015-02-03 Thread Charles Feduke
In case anyone needs to merge all of their part-n files (small result set only) into a single *.csv file or needs to generically flatten case classes, tuples, etc., into comma separated values: http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/ On Tue Feb 03 2015 at 8:23:59 AM

advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-03 Thread Michael Albert
Greetings! First, my sincere thanks to all who have given me advice.Following previous discussion, I've rearranged my code to try to keep the partitions to more manageable sizes.Thanks to all who commented. At the moment, the input set I'm trying to work with is about 90GB (avro parquet

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread Sven Krasser
Hey Joe, With the ephemeral HDFS, you get the instance store of your worker nodes. For m3.xlarge that will be two 40 GB SSDs local to each instance, which are very fast. For the persistent HDFS, you get whatever EBS volumes the launch script configured. EBS volumes are always network drives, so

Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
I am trying to implement secondary sort in spark as we do in map-reduce. Here is my data(tab separated, without c1, c2, c2). c1c2 c3 1 2 4 1 3 6 2 4 7 2 6 8 3 5 5 3 1 8 3 2 0 To do secondary sort, I

Re: Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
Just to add, I am suing Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted is at: OLAP with Cassandra and Spark http://www.slideshare.net/EvanChan2/2014-07olapcassspark. On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad j...@jonhaddad.com wrote: Write out the rdd to a cassandra table. The

Re: 2GB limit for partitions?

2015-02-03 Thread Mridul Muralidharan
That is fairly out of date (we used to run some of our jobs on it ... But that is forked off 1.1 actually). Regards Mridul On Tuesday, February 3, 2015, Imran Rashid iras...@cloudera.com wrote: Thanks for the explanations, makes sense. For the record looks like this was worked on a while