Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
In general, I don't think that means you should exclude something; it's still needed. The problem is that commons config depends *only* on *beanutils-core 1.8.0* so it ends up managing up that artifact version only, and not the main beanutils one. In this particular instance, which I've seen befo

Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
I found this jira when googling for fixes. Wonder if it can fix anything here. But anyways, thanks for the help :) On Fri, Apr 10, 2015 at 2:46 AM, Sean Owen wrote: > I agree, but as I say, most are out of the control of Spark. They > aren't because

Re: Overlapping classes warnings

2015-04-09 Thread Ted Yu
commons-beanutils is brought in transitively: [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.4.0:compile [INFO] | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | | +-

Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
I agree, but as I say, most are out of the control of Spark. They aren't because of unnecessary dependencies. On Thu, Apr 9, 2015 at 5:14 PM, Ritesh Kumar Singh wrote: > Though the warnings can be ignored, they add up in the log files while > compiling other projects too. And there are a lot of t

Re: make two rdd co-partitioned in python

2015-04-09 Thread Davies Liu
In Spark 1.3+, PySpark also support this kind of narrow dependencies, for example, N = 10 a1 = a.partitionBy(N) b1 = b.partitionBy(N) then a1.union(b1) will only have N partitions. So, a1.join(b1) do not need shuffle anymore. On Thu, Apr 9, 2015 at 11:57 AM, pop wrote: > In scala, we can make

Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
Though the warnings can be ignored, they add up in the log files while compiling other projects too. And there are a lot of those warnings. Any workaround? How do we modify the pom.xml file to exclude these unnecessary dependencies? On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen wrote: > Generally,

Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
Generally, you can ignore these things. They mean some artifacts packaged other artifacts, and so two copies show up when all the JAR contents are merged. But here you do show a small dependency convergence problem; beanutils 1.7 is present but beanutills-core 1.8 is too even though these should b

Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
Hi, During compilation I get a lot of these: [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define 23 overlappping classes: [WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar define 82 overlappping classes: [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3

Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
Finally got it working by increasing the spark version in maven to 1.2.1 On Thu, Apr 9, 2015 at 12:30 PM, Mohit Anchlia wrote: > I changed the JDK to Oracle but I still get this error. Not sure what it > means by "Stream class is incompatible with local class". I am using the > following build o

Re: Lookup / Access of master data in spark streaming

2015-04-09 Thread Amit Assudani
Thanks a lot TD for detailed answers. The answers lead to few more questions, 1. "the transform RDD-to-RDD function runs on the driver “ - I didn’t understand this, does it mean when I use transform function on DStream, it is not parallelized, surely I m missing something here. 2. update

Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
I changed the JDK to Oracle but I still get this error. Not sure what it means by "Stream class is incompatible with local class". I am using the following build on the server "spark-1.2.1-bin-hadoop2.4" 15/04/09 15:26:24 ERROR JobScheduler: Error running job streaming job 1428607584000 ms.0 org.a

Re: Lookup / Access of master data in spark streaming

2015-04-09 Thread Tathagata Das
Responses inline. Hope they help. On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani wrote: > Hi Friends, > > I am trying to solve a use case in spark streaming, I need help on > getting to right approach on lookup / update the master data. > > Use case ( simplified ) > I’ve a dataset of entity w

Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Your point #1 is a bit misleading. >> (1) The mappers are not executed in parallel when processing independently the same RDD. To clarify, I'd say: In one stage of execution, when pipelining occurs, mappers are not executed in parallel when processing independently the same RDD partition. On Thu

make two rdd co-partitioned in python

2015-04-09 Thread pop
In scala, we can make two Rdd using the same partitioner so that they are co-partitioned val partitioner = new HashPartitioner(5) val a1 = a.partitionBy(partitioner).cache() val b1 = b.partiitonBy(partitioner).cache() How can we achieve the same in python? It would be great if somebod

Re: "Could not compute split, block not found" in Spark Streaming Simple Application

2015-04-09 Thread Tathagata Das
Are you running # of receivers = # machines? TD On Thu, Apr 9, 2015 at 9:56 AM, Saiph Kappa wrote: > Sorry, I was getting those errors because my workload was not sustainable. > > However, I noticed that, by just running the spark-streaming-benchmark ( > https://github.com/tdas/spark-streaming-

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread Ankur Dave
Actually, GraphX doesn't need to scan all the edges, because it maintains a clustered index on the source vertex id (that is, it sorts the edges by source vertex id and stores the offsets in a hash table). If the activeDirection is appropriately set, it can then jump only to the clusters with activ

Re: Continuous WARN messages from BlockManager about block replication

2015-04-09 Thread Tathagata Das
Well, you are running in local mode, so it cannot find another peer to replicate the blocks received from receivers. That's it. Its not a real concern and that error will go away when you are run it in a cluster. On Thu, Apr 9, 2015 at 11:24 AM, Nandan Tammineedi wrote: > Hi, > > I'm running a s

Continuous WARN messages from BlockManager about block replication

2015-04-09 Thread Nandan Tammineedi
Hi, I'm running a spark streaming job in local mode (--master local[4]), and I'm seeing tons of these messages, roughly once every second - WARN BlockManager: Block input-0-1428527584600 replicated to only 0 peer(s) instead of 1 peers We're using spark 1.2.1. Even with TRACE logging enabled, we'

Re: Caching and Actions

2015-04-09 Thread spark_user_2015
That was helpful! The conclusion: (1) The mappers are not executed in parallel when processing independently the same RDD. (2) The best way seems to be (if enough memory is available and an action is applied to d1 and d2 later on) val d1 = data.map((x,y,z) => (x,y)).cache val d2 = d1

Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
Thanks Ted, using HiveTest as my context worked. It still left a metastore directory and Derby log in my current working directory though; I manually added a shutdown hook to delete them and all was well. On Wed, Apr 8, 2015 at 4:33 PM, Ted Yu wrote: > Please take a look at > sql/hive/src/main/s

Re: Spark Job #of attempts ?

2015-04-09 Thread Deepak Jain
Can I see current values of all configs. Similar to configuration in Hadoop world from ui ? Sent from my iPhone > On 09-Apr-2015, at 11:07 pm, Marcelo Vanzin wrote: > > Set spark.yarn.maxAppAttempts=1 if you don't want retries. > >> On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: >> He

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to Hive 0.12 if you specify it in the profile when building Spark as per https://spark.apache.org/docs/1.3.0/building-spark.html. If you are downloading a pre built version of Spark 1.3 - then by default, it is set to Hive 0.1

Re: Spark Job #of attempts ?

2015-04-09 Thread Marcelo Vanzin
Set spark.yarn.maxAppAttempts=1 if you don't want retries. On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Hello, > I have a spark job with 5 stages. After it runs 3rd stage, the console shows > > > 15/04/09 10:25:57 INFO yarn.Client: Application report for > application_1427705526386_127

Spark Job #of attempts ?

2015-04-09 Thread ๏̯͡๏
Hello, I have a spark job with 5 stages. After it runs 3rd stage, the console shows 15/04/09 10:25:57 INFO yarn.Client: Application report for application_1427705526386_127168 (state: RUNNING) 15/04/09 10:25:58 INFO yarn.Client: Application report for application_1427705526386_127168 (state: RUNN

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread ๏̯͡๏
Thanks Sandy, apprechiate On Thu, Apr 9, 2015 at 10:32 PM, Sandy Ryza wrote: > Hi Deepak, > > I'm going to shamelessly plug my blog post on tuning Spark: > > http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ > > It talks about tuning executor size as well as how th

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread Sandy Ryza
Hi Deepak, I'm going to shamelessly plug my blog post on tuning Spark: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ It talks about tuning executor size as well as how the number of tasks for a stage is calculated. -Sandy On Thu, Apr 9, 2015 at 9:21 AM, ÐΞ€ρ@Ҝ

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread ๏̯͡๏
Most likely you have an existing Hive installation with data in it. In this case i was not able to get Spark 1.3 communicate with existing Hive meta store. Hence when i read any table created in hive, Spark SQL used to complain "Data table not found" If you get it working, please share the steps.

Re: Join on Spark too slow.

2015-04-09 Thread ๏̯͡๏
If your data has special characteristics like one small other large then you can think of doing map side join in Spark using (Broadcast Values), this will speed up things. Otherwise as Pitel mentioned if there is nothing special and its just cartesian product it might take ever, or you might incre

Re: "Could not compute split, block not found" in Spark Streaming Simple Application

2015-04-09 Thread Saiph Kappa
Sorry, I was getting those errors because my workload was not sustainable. However, I noticed that, by just running the spark-streaming-benchmark ( https://github.com/tdas/spark-streaming-benchmark/blob/master/Benchmark.scala ), I get no difference on the execution time, number of processed record

Spark Job Run Resource Estimation ?

2015-04-09 Thread ๏̯͡๏
I have a spark job that has multiple stages. For now i star it with 100 executors, each with 12G mem (max is 16G). I am using Spark 1.3 over YARN 2.4.x. For now i start the Spark Job with a very limited input (1 file of size 2G), overall there are 200 files. My first run is yet to complete as its

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Yeah, just ran with 2g for that setting and max.mb with 1068 I am trying to do a map-side join by using broadcast variable. This first collects all the data (key, value) and then sends it. Its causing error while running this stage. On Thu, Apr 9, 2015 at 9:29 PM, Ted Yu wrote: > Typo in previo

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Typo in previous email, pardon me. Set "spark.driver.maxResultSize" to 1068 or higher. On Thu, Apr 9, 2015 at 8:57 AM, Ted Yu wrote: > Please set "spark.kryoserializer.buffer.max.mb" to 1068 (or higher). > > Cheers > > On Thu, Apr 9, 2015 at 8:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > >> Pressed send earl

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Please set "spark.kryoserializer.buffer.max.mb" to 1068 (or higher). Cheers On Thu, Apr 9, 2015 at 8:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Pressed send early. > > I had tried that with these settings > > buffersize=128 maxbuffersize=1024 > > val conf = new SparkConf() > > .setAppName(detail) >

Which Hive version should be used for Spark 1.3

2015-04-09 Thread Arthur Chan
Hi, I use Hive 0.12 for Spark 1.2 at the moment and plan to upgrade to Spark 1.3.x Could anyone advise which Hive version should be used to match Spark 1.3.x? Can I use Hive 1.1.0 for Spark 1.3? or can I use Hive 0.14 for Spark 1.3? Regards Arthur

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Pressed send early. I had tried that with these settings buffersize=128 maxbuffersize=1024 val conf = new SparkConf() .setAppName(detail) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryoserializer.buffer.mb",arguments.get("buffersize").g

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Yes i had tried that. Now i see this 15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at VISummaryDataProvider.scala:38, took 275.334991 s 15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw exception: Job aborted due to stage failure: Total size of serialized res

Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee
Can you create the database directly within Hive? If you're getting the same error within Hive, it sounds like a permissions issue as per Bojan. More info can be found at: http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error On Thu, Apr 9, 201

Lookup / Access of master data in spark streaming

2015-04-09 Thread Amit Assudani
Hi Friends, I am trying to solve a use case in spark streaming, I need help on getting to right approach on lookup / update the master data. Use case ( simplified ) I've a dataset of entity with three attributes and identifier/row key in a persistent store. Each attribute along with row key co

Re: Pairwise computations within partition

2015-04-09 Thread Guillaume Pitel
I would try something like that : val a = rdd.sample(false,0.1,1).zipwithindex.map{ case (vector,index) => (index,vector)} val b = rdd.sample(false,0.1,2).zipwithindex.map{ case (vector,index) => (index,vector)} a.join(b).map { case (_,(vectora,vectorb)) => yourOperation } Grouping by blocks

Re: Join on Spark too slow.

2015-04-09 Thread Guillaume Pitel
Maybe I'm wrong, but what you are doing here is basically a bunch of cartesian product for each key. So if "hello" appear 100 times in your corpus, it will produce 100*100 elements in the join output. I don't understand what you're doing here, but it's normal your join takes forever, it makes

Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread Guillaume Pitel
Hi, From my experience, those errors happen under very high memory pressure, and/or with machines with bad hardware (memory, network card,..) I have had a few of them, as well as Snappy uncompress errors, on a machine with a slightly failing memory stick. Given the large amount of data trans

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Please take a look at https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236 , starting line 27. In Spark, you can control the maxBufferSize with "spark.kryoserializer.buffer.max.mb" Cheers

RDD union

2015-04-09 Thread Debasish Das
Hi, I have some code that creates ~ 80 RDD and then a sc.union is applied to combine all 80 into one for the next step (to run topByKey for example)... While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3 hrs (I am validating these numbers)... Is there any checkpoint based

Any success on embedding local Spark in OSGi?

2015-04-09 Thread Deniz Acay
Hi, I have been trying to use Spark in an OSGi bundle but I had no luck so far. I have seen similar mails in the past, so I am wondering, had anyone successfully run Spark inside an OSGi bundle? I am running Spark in the bundle created with Maven shade plugin and even tried adding Akka JARs in

Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread Ted Yu
Is there custom class involved in your application ? I assume you have called sparkConf.registerKryoClasses() for such class(es). Cheers On Thu, Apr 9, 2015 at 7:15 AM, mehdisinger wrote: > Hi, > > I'm facing an issue when I try to run my Spark application. I keep getting > the following excep

spark job progress-style report on console ?

2015-04-09 Thread roy
Hi, How do i get spark job progress-style report on console ? I tried to set --conf spark.ui.showConsoleProgress=true but it thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440.html Sent from the Ap

Re: SQL can't not create Hive database

2015-04-09 Thread Bojan Kostic
I think it uses local dir, hdfs dir path starts with hdfs:// Check permissions on folders, and also check logs. There should be more info about exception. Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-tp22435

Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
My Spark (1.3.0) job is failing with com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1+details com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at c

How to submit job in a different user?

2015-04-09 Thread SecondDatke
Well, maybe a Linux configure problem... I have a cluster that is about to expose to the public, and I want everyone that uses my cluster owns a user (without permissions of sudo, etc.)(e.g. 'guest'), and is able to submit tasks to Spark, which working on Mesos that running with a different, pri

Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread mehdisinger
Hi, I'm facing an issue when I try to run my Spark application. I keep getting the following exception: 15/04/09 15:14:07 ERROR Executor: Exception in task 5.0 in stage 1.0 (TID 5) com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994 Serialization trace: ord (org.apac

Pairwise computations within partition

2015-04-09 Thread abellet
Hello everyone, I am a Spark novice facing a nontrivial problem to solve with Spark. I have an RDD consisting of many elements (say, 60K), where each element is is a d-dimensional vector. I want to implement an iterative algorithm which does the following. At each iteration, I want to apply an o

RE: SQL can't not create Hive database

2015-04-09 Thread java8964
Can you try the URI of local file format, something like this: hiveContext.hql("SET hive.metastore.warehouse.dir=file:///home/spark/hive/warehouse") Yong > Date: Thu, 9 Apr 2015 04:59:00 -0700 > From: inv...@gmail.com > To: user@spark.apache.org > Subject: SQL can't not create Hive database > > H

Join on Spark too slow.

2015-04-09 Thread Kostas Kloudas
Hello guys, I am trying to run the following dummy example for Spark, on a dataset of 250MB, using 5 machines with >10GB RAM each, but the join seems to be taking too long (> 2hrs). I am using Spark 0.8.0 but I have also tried the same example on more recent versions, with the same results. Do y

save as text file throwing null pointer error.

2015-04-09 Thread Somnath Pandeya
JavaRDD lineswithoutStopWords = nonEmptylines .map(new Function() { /** * */ private static final long serialVersionUID = 1L;

Re: override log4j.properties

2015-04-09 Thread Emre Sevinc
One method: By putting your custom log4j.properties file in your /resources directory. As an example, please see: http://stackoverflow.com/a/2736/236007 Kind regards, Emre Sevinç http://www.bigindustries.be/ On Thu, Apr 9, 2015 at 2:17 PM, patcharee wrote: > Hello, > > How to override l

override log4j.properties

2015-04-09 Thread patcharee
Hello, How to override log4j.properties for a specific spark job? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-09 Thread Todd Nist
Hi Mohammed, Sorry, I guess I was not really clear in my response. Yes sbt fails, the -DskipTests is for mvn as I showed it in the example on how II built it. I do not believe that -DskipTests has any impact in sbt, but could be wrong. sbt package should skip tests. I did not try to track down

SQL can't not create Hive database

2015-04-09 Thread Hao Ren
Hi, I am working on the local mode. The following code hiveContext.setConf("hive.metastore.warehouse.dir", /home/spark/hive/warehouse) hiveContext.sql("create database if not exists db1") throws 15/04/09 13:53:16 ERROR RetryingHMSHandler: MetaException(message:Unable to create database path

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos
Ok, what do i need to do in order to migrate the patch? Thanks Alex On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma wrote: > This is the jira I referred to > https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not > working on it is evaluating priority between upgrading to sca

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Prashant Sharma
This is the jira I referred to https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not working on it is evaluating priority between upgrading to scala 2.11.5(it is non trivial I suppose because repl has changed a bit) or migrating that patch is much simpler. Prashant Sharma On

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos
Hi- Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988 Any help in getting this working would be much appreciated! Thanks Alex On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma wrote: > You are right this needs to be done. I can work on it soon, I was not sure > if there

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Prashant Sharma
You are right this needs to be done. I can work on it soon, I was not sure if there is any one even using scala 2.11 spark repl. Actually there is a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which has to be ported for scala 2.11 too. If however, you(or anyone else) are pl

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread James
In aggregateMessagesWithActiveSet, Spark still have to read all edges. It means that a fixed time which scale with graph size is unavoidable on a pregel-like iteration. But what if I have to iterate nearly 100 iterations but at the last 50 iterations there are only < 0.1% nodes need to be updated

Spark Streaming scenarios

2015-04-09 Thread Vinay Kesarwani
Hi, I have following scenario.. need some help ASAP 1. Ad hoc query on spark streaming. How can i run spark queries on ongoing streaming context. Scenario: If a stream job running to find out min and max value in last 5 min(which i am able to do.) Now i want to run interactive query to f

External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread anakos
Hi- I am having difficulty getting the 1.3.0 Spark shell to find an external jar. I have build Spark locally for Scala 2.11 and I am starting the REPL as follows: bin/spark-shell --master yarn --jars data-api-es-data-export-4.0.0.jar I see the following line in the console output: 15/04/09 09:

Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Hi there, You should be selective about which RDDs you cache and which you don't. A good candidate RDD for caching is one that you reuse multiple times. Commonly the reuse is for iterative machine learning algorithms that need to take multiple passes over the data. If you try to cache a really la

Re: Caching and Actions

2015-04-09 Thread Bojan Kostic
You can use toDebugString to see all the steps in job. Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Add row IDs column to data frame

2015-04-09 Thread Bojan Kostic
Hi, I just checked and i can see that there is method called withColumn: def withColumn(colName: String, col: Column ): DataFrame