Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
In general, I don't think that means you should exclude something; it's still needed. The problem is that commons config depends *only* on *beanutils-core 1.8.0* so it ends up managing up that artifact version only, and not the main beanutils one. In this particular instance, which I've seen

Pairwise computations within partition

2015-04-09 Thread abellet
Hello everyone, I am a Spark novice facing a nontrivial problem to solve with Spark. I have an RDD consisting of many elements (say, 60K), where each element is is a d-dimensional vector. I want to implement an iterative algorithm which does the following. At each iteration, I want to apply an

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to Hive 0.12 if you specify it in the profile when building Spark as per https://spark.apache.org/docs/1.3.0/building-spark.html. If you are downloading a pre built version of Spark 1.3 - then by default, it is set to Hive

RDD union

2015-04-09 Thread Debasish Das
Hi, I have some code that creates ~ 80 RDD and then a sc.union is applied to combine all 80 into one for the next step (to run topByKey for example)... While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3 hrs (I am validating these numbers)... Is there any checkpoint

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos
Ok, what do i need to do in order to migrate the patch? Thanks Alex On Thu, Apr 9, 2015 at 11:54 AM, Prashant Sharma scrapco...@gmail.com wrote: This is the jira I referred to https://issues.apache.org/jira/browse/SPARK-3256. Another reason for not working on it is evaluating priority

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-09 Thread Saiph Kappa
Sorry, I was getting those errors because my workload was not sustainable. However, I noticed that, by just running the spark-streaming-benchmark ( https://github.com/tdas/spark-streaming-benchmark/blob/master/Benchmark.scala ), I get no difference on the execution time, number of processed

Re: Kryo exception : Encountered unregistered class ID: 13994

2015-04-09 Thread Ted Yu
Is there custom class involved in your application ? I assume you have called sparkConf.registerKryoClasses() for such class(es). Cheers On Thu, Apr 9, 2015 at 7:15 AM, mehdisinger mehdi.sin...@lampiris.be wrote: Hi, I'm facing an issue when I try to run my Spark application. I keep getting

Re: Incremently load big RDD file into Memory

2015-04-09 Thread MUHAMMAD AAMIR
Hi, Thanks a lot for such a detailed response. On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi Muhammad, There are lots of ways to do it. My company actually develops a text mining solution which embeds a very fast Approximate Neighbours solution (a

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Typo in previous email, pardon me. Set spark.driver.maxResultSize to 1068 or higher. On Thu, Apr 9, 2015 at 8:57 AM, Ted Yu yuzhih...@gmail.com wrote: Please set spark.kryoserializer.buffer.max.mb to 1068 (or higher). Cheers On Thu, Apr 9, 2015 at 8:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com

How to submit job in a different user?

2015-04-09 Thread SecondDatke
Well, maybe a Linux configure problem... I have a cluster that is about to expose to the public, and I want everyone that uses my cluster owns a user (without permissions of sudo, etc.)(e.g. 'guest'), and is able to submit tasks to Spark, which working on Mesos that running with a different,

Re: Spark Job #of attempts ?

2015-04-09 Thread Deepak Jain
Can I see current values of all configs. Similar to configuration in Hadoop world from ui ? Sent from my iPhone On 09-Apr-2015, at 11:07 pm, Marcelo Vanzin van...@cloudera.com wrote: Set spark.yarn.maxAppAttempts=1 if you don't want retries. On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)

SQL can't not create Hive database

2015-04-09 Thread Hao Ren
Hi, I am working on the local mode. The following code hiveContext.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse) hiveContext.sql(create database if not exists db1) throws 15/04/09 13:53:16 ERROR RetryingHMSHandler: MetaException(message:Unable to create database path

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread ๏̯͡๏
Thanks Sandy, apprechiate On Thu, Apr 9, 2015 at 10:32 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Deepak, I'm going to shamelessly plug my blog post on tuning Spark: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ It talks about tuning executor size

Re: Spark Job #of attempts ?

2015-04-09 Thread Marcelo Vanzin
Set spark.yarn.maxAppAttempts=1 if you don't want retries. On Thu, Apr 9, 2015 at 10:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Hello, I have a spark job with 5 stages. After it runs 3rd stage, the console shows 15/04/09 10:25:57 INFO yarn.Client: Application report for

Re: Join on Spark too slow.

2015-04-09 Thread Guillaume Pitel
Maybe I'm wrong, but what you are doing here is basically a bunch of cartesian product for each key. So if hello appear 100 times in your corpus, it will produce 100*100 elements in the join output. I don't understand what you're doing here, but it's normal your join takes forever, it makes

Re: Pairwise computations within partition

2015-04-09 Thread Guillaume Pitel
I would try something like that : val a = rdd.sample(false,0.1,1).zipwithindex.map{ case (vector,index) = (index,vector)} val b = rdd.sample(false,0.1,2).zipwithindex.map{ case (vector,index) = (index,vector)} a.join(b).map { case (_,(vectora,vectorb)) = yourOperation } Grouping by blocks is

Any success on embedding local Spark in OSGi?

2015-04-09 Thread Deniz Acay
Hi, I have been trying to use Spark in an OSGi bundle but I had no luck so far. I have seen similar mails in the past, so I am wondering, had anyone successfully run Spark inside an OSGi bundle? I am running Spark in the bundle created with Maven shade plugin and even tried adding Akka JARs

Re: Overlapping classes warnings

2015-04-09 Thread Ted Yu
commons-beanutils is brought in transitively: [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.4.0:compile [INFO] | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | | +-

override log4j.properties

2015-04-09 Thread patcharee
Hello, How to override log4j.properties for a specific spark job? BR, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Join on Spark too slow.

2015-04-09 Thread Kostas Kloudas
Hello guys, I am trying to run the following dummy example for Spark, on a dataset of 250MB, using 5 machines with 10GB RAM each, but the join seems to be taking too long ( 2hrs). I am using Spark 0.8.0 but I have also tried the same example on more recent versions, with the same results. Do

Continuous WARN messages from BlockManager about block replication

2015-04-09 Thread Nandan Tammineedi
Hi, I'm running a spark streaming job in local mode (--master local[4]), and I'm seeing tons of these messages, roughly once every second - WARN BlockManager: Block input-0-1428527584600 replicated to only 0 peer(s) instead of 1 peers We're using spark 1.2.1. Even with TRACE logging enabled,

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Yes i had tried that. Now i see this 15/04/09 07:58:08 INFO scheduler.DAGScheduler: Job 0 failed: collect at VISummaryDataProvider.scala:38, took 275.334991 s 15/04/09 07:58:08 ERROR yarn.ApplicationMaster: User class threw exception: Job aborted due to stage failure: Total size of serialized

Lookup / Access of master data in spark streaming

2015-04-09 Thread Amit Assudani
Hi Friends, I am trying to solve a use case in spark streaming, I need help on getting to right approach on lookup / update the master data. Use case ( simplified ) I've a dataset of entity with three attributes and identifier/row key in a persistent store. Each attribute along with row key

Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
Generally, you can ignore these things. They mean some artifacts packaged other artifacts, and so two copies show up when all the JAR contents are merged. But here you do show a small dependency convergence problem; beanutils 1.7 is present but beanutills-core 1.8 is too even though these should

Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
I changed the JDK to Oracle but I still get this error. Not sure what it means by Stream class is incompatible with local class. I am using the following build on the server spark-1.2.1-bin-hadoop2.4 15/04/09 15:26:24 ERROR JobScheduler: Error running job streaming job 1428607584000 ms.0

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread James
In aggregateMessagesWithActiveSet, Spark still have to read all edges. It means that a fixed time which scale with graph size is unavoidable on a pregel-like iteration. But what if I have to iterate nearly 100 iterations but at the last 50 iterations there are only 0.1% nodes need to be updated

Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
Though the warnings can be ignored, they add up in the log files while compiling other projects too. And there are a lot of those warnings. Any workaround? How do we modify the pom.xml file to exclude these unnecessary dependencies? On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen so...@cloudera.com

Re: make two rdd co-partitioned in python

2015-04-09 Thread Davies Liu
In Spark 1.3+, PySpark also support this kind of narrow dependencies, for example, N = 10 a1 = a.partitionBy(N) b1 = b.partitionBy(N) then a1.union(b1) will only have N partitions. So, a1.join(b1) do not need shuffle anymore. On Thu, Apr 9, 2015 at 11:57 AM, pop xia...@adobe.com wrote: In

Re: override log4j.properties

2015-04-09 Thread Emre Sevinc
One method: By putting your custom log4j.properties file in your /resources directory. As an example, please see: http://stackoverflow.com/a/2736/236007 Kind regards, Emre Sevinç http://www.bigindustries.be/ On Thu, Apr 9, 2015 at 2:17 PM, patcharee patcharee.thong...@uni.no wrote:

Re: Add row IDs column to data frame

2015-04-09 Thread Bojan Kostic
Hi, I just checked and i can see that there is method called withColumn: def withColumn(colName: String, col: Column http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html ): DataFrame http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html

spark job progress-style report on console ?

2015-04-09 Thread roy
Hi, How do i get spark job progress-style report on console ? I tried to set --conf spark.ui.showConsoleProgress=true but it thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440.html Sent from the

Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee
Can you create the database directly within Hive? If you're getting the same error within Hive, it sounds like a permissions issue as per Bojan. More info can be found at: http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error On Thu, Apr 9,

Re: Lookup / Access of master data in spark streaming

2015-04-09 Thread Tathagata Das
Responses inline. Hope they help. On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani aassud...@impetus.com wrote: Hi Friends, I am trying to solve a use case in spark streaming, I need help on getting to right approach on lookup / update the master data. Use case ( simplified ) I’ve a

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-09 Thread Todd Nist
Hi Mohammed, Sorry, I guess I was not really clear in my response. Yes sbt fails, the -DskipTests is for mvn as I showed it in the example on how II built it. I do not believe that -DskipTests has any impact in sbt, but could be wrong. sbt package should skip tests. I did not try to track

Re: Spark Job Run Resource Estimation ?

2015-04-09 Thread Sandy Ryza
Hi Deepak, I'm going to shamelessly plug my blog post on tuning Spark: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ It talks about tuning executor size as well as how the number of tasks for a stage is calculated. -Sandy On Thu, Apr 9, 2015 at 9:21 AM,

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
Pressed send early. I had tried that with these settings buffersize=128 maxbuffersize=1024 val conf = new SparkConf() .setAppName(detail) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryoserializer.buffer.mb,arguments.get(buffersize).get)

Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread ๏̯͡๏
My Spark (1.3.0) job is failing with com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1+details com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at

Re: SQL can't not create Hive database

2015-04-09 Thread Bojan Kostic
I think it uses local dir, hdfs dir path starts with hdfs:// Check permissions on folders, and also check logs. There should be more info about exception. Best Bojan -- View this message in context:

save as text file throwing null pointer error.

2015-04-09 Thread Somnath Pandeya
JavaRDDString lineswithoutStopWords = nonEmptylines .map(new FunctionString, String() { /** * */ private static final long

Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Your point #1 is a bit misleading. (1) The mappers are not executed in parallel when processing independently the same RDD. To clarify, I'd say: In one stage of execution, when pipelining occurs, mappers are not executed in parallel when processing independently the same RDD partition. On Thu,

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Alex Nakos
Hi- Was this the JIRA issue? https://issues.apache.org/jira/browse/SPARK-2988 Any help in getting this working would be much appreciated! Thanks Alex On Thu, Apr 9, 2015 at 11:32 AM, Prashant Sharma scrapco...@gmail.com wrote: You are right this needs to be done. I can work on it soon, I was

Spark Job Run Resource Estimation ?

2015-04-09 Thread ๏̯͡๏
I have a spark job that has multiple stages. For now i star it with 100 executors, each with 12G mem (max is 16G). I am using Spark 1.3 over YARN 2.4.x. For now i start the Spark Job with a very limited input (1 file of size 2G), overall there are 200 files. My first run is yet to complete as its

Re: Lookup / Access of master data in spark streaming

2015-04-09 Thread Amit Assudani
Thanks a lot TD for detailed answers. The answers lead to few more questions, 1. the transform RDD-to-RDD function runs on the driver “ - I didn’t understand this, does it mean when I use transform function on DStream, it is not parallelized, surely I m missing something here. 2.

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread ๏̯͡๏
Most likely you have an existing Hive installation with data in it. In this case i was not able to get Spark 1.3 communicate with existing Hive meta store. Hence when i read any table created in hive, Spark SQL used to complain Data table not found If you get it working, please share the steps.

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-09 Thread Tathagata Das
Are you running # of receivers = # machines? TD On Thu, Apr 9, 2015 at 9:56 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Sorry, I was getting those errors because my workload was not sustainable. However, I noticed that, by just running the spark-streaming-benchmark (

Re: Overlapping classes warnings

2015-04-09 Thread Sean Owen
I agree, but as I say, most are out of the control of Spark. They aren't because of unnecessary dependencies. On Thu, Apr 9, 2015 at 5:14 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Though the warnings can be ignored, they add up in the log files while compiling other projects

Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
I found this jira https://jira.codehaus.org/browse/MSHADE-128 when googling for fixes. Wonder if it can fix anything here. But anyways, thanks for the help :) On Fri, Apr 10, 2015 at 2:46 AM, Sean Owen so...@cloudera.com wrote: I agree, but as I say, most are out of the control of Spark. They

Re: Jobs failing with KryoException (BufferOverflow)

2015-04-09 Thread Ted Yu
Please take a look at https://code.google.com/p/kryo/source/browse/trunk/src/com/esotericsoftware/kryo/io/Output.java?r=236 , starting line 27. In Spark, you can control the maxBufferSize with spark.kryoserializer.buffer.max.mb Cheers

Spark Streaming scenarios

2015-04-09 Thread Vinay Kesarwani
Hi, I have following scenario.. need some help ASAP 1. Ad hoc query on spark streaming. How can i run spark queries on ongoing streaming context. Scenario: If a stream job running to find out min and max value in last 5 min(which i am able to do.) Now i want to run interactive query to

Re: Join on Spark too slow.

2015-04-09 Thread ๏̯͡๏
If your data has special characteristics like one small other large then you can think of doing map side join in Spark using (Broadcast Values), this will speed up things. Otherwise as Pitel mentioned if there is nothing special and its just cartesian product it might take ever, or you might

Spark Job #of attempts ?

2015-04-09 Thread ๏̯͡๏
Hello, I have a spark job with 5 stages. After it runs 3rd stage, the console shows 15/04/09 10:25:57 INFO yarn.Client: Application report for application_1427705526386_127168 (state: RUNNING) 15/04/09 10:25:58 INFO yarn.Client: Application report for application_1427705526386_127168 (state:

Re: Caching and Actions

2015-04-09 Thread Bojan Kostic
You can use toDebugString to see all the steps in job. Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Continuous WARN messages from BlockManager about block replication

2015-04-09 Thread Tathagata Das
Well, you are running in local mode, so it cannot find another peer to replicate the blocks received from receivers. That's it. Its not a real concern and that error will go away when you are run it in a cluster. On Thu, Apr 9, 2015 at 11:24 AM, Nandan Tammineedi nan...@defend7.com wrote: Hi,

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread Ankur Dave
Actually, GraphX doesn't need to scan all the edges, because it maintains a clustered index on the source vertex id (that is, it sorts the edges by source vertex id and stores the offsets in a hash table). If the activeDirection is appropriately set, it can then jump only to the clusters with

make two rdd co-partitioned in python

2015-04-09 Thread pop
In scala, we can make two Rdd using the same partitioner so that they are co-partitioned val partitioner = new HashPartitioner(5) val a1 = a.partitionBy(partitioner).cache() val b1 = b.partiitonBy(partitioner).cache() How can we achieve the same in python? It would be great if

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-09 Thread Prashant Sharma
You are right this needs to be done. I can work on it soon, I was not sure if there is any one even using scala 2.11 spark repl. Actually there is a patch in scala 2.10 shell to support adding jars (Lost the JIRA ID), which has to be ported for scala 2.11 too. If however, you(or anyone else) are

Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
Thanks Ted, using HiveTest as my context worked. It still left a metastore directory and Derby log in my current working directory though; I manually added a shutdown hook to delete them and all was well. On Wed, Apr 8, 2015 at 4:33 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at