Re: spark-submit on YARN is slow

2014-12-06 Thread Sandy Ryza
Great to hear! -Sandy On Fri, Dec 5, 2014 at 11:17 PM, Denny Lee denny.g@gmail.com wrote: Okay, my bad for not testing out the documented arguments - once i use the correct ones, the query shrinks completes in ~55s (I can probably make it faster). Thanks for the help, eh?! On Fri

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
With that said, and the nature of iterative algorithms that Spark is advertised for, isn't this a bit of an unnecessary restriction since I don't see where the problem is. For instance, it is clear that when aggregating you need operations to be associative because of the way they are divided

Re: Java RDD Union

2014-12-06 Thread Sean Owen
I guess a major problem with this is that you lose fault tolerance. You have no way of recreating the local state of the mutable RDD if a partition is lost. Why would you need thousands of RDDs for kmeans? it's a few per iteration. An RDD is more bookkeeping that data structure, itself. They

Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
This is from a separate thread with a differently named title. Why can't you modify the actual contents of an RDD using forEach? It appears to be working for me. What I'm doing is changing cluster assignments and distances per data item for each iteration of the clustering algorithm. The

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
Hiearchical K-means require a massive amount of iterations whereas flat K-means does not but I've found flat to be generally useless since in most UIs it is nice to be able to drill down into more and more specific clusters. If you have 100 million documents and your branching factor is 8

Re: Modifying an RDD in forEach

2014-12-06 Thread Mayur Rustagi
You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how it optimizes execution of iterative jobs. Simple answer is 1. Spark doesn't materialize RDD when you do an iteration but lazily captures the transformation functions in RDD.(only function and closure , no data operation

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal

Re: Including data nucleus tools

2014-12-06 Thread spark.dubovsky.jakub
Hi again, I have tried to recompile and run this again with new assembly created by ./make-distribution.sh -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.3 -Pyarn - Phive -DskipTests It results in exactly the same error. Any other hints? Bonus question: Should the class org.datanucleus.api.jdo.

PySpark Loading Json Following by groupByKey seems broken in spark 1.1.1

2014-12-06 Thread Brad Willard
When I run a groupByKey it seems to create a single tasks after the groupByKey that never stops executing. I'm loading a smallish json dataset that is 4 million. This is the code I'm running. rdd = sql_context.jsonFile(uri) rdd = rdd.cache() grouped = rdd.map(lambda row: (row.id,

Re: cartesian on pyspark not paralleised

2014-12-06 Thread Akhil Das
You could try increasing the level of parallelism (spark.default.parallelism) while creating the sparkContext Thanks Best Regards On Fri, Dec 5, 2014 at 6:37 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - I

Re: Where can you get nightly builds?

2014-12-06 Thread Ted Yu
See https://amplab.cs.berkeley.edu/jenkins/view/Spark/ See also https://issues.apache.org/jira/browse/SPARK-1517 Cheers On Sat, Dec 6, 2014 at 6:41 AM, Simone Franzini captainfr...@gmail.com wrote: I recently read in the mailing list that there are now nightly builds available. However, I

Is there a way to force spark to use specific ips?

2014-12-06 Thread Ashic Mahtab
Hi,It appears that spark is always attempting to use the driver's hostname to connect / broadcast. This is usually fine, except when the cluster doesn't have DNS configured. For example, in a vagrant cluster with a private network. The workers and masters, and the host (where the driver runs

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Corey Nolet
Reading the documentation a little more closely, I'm using the wrong terminology. I'm using stages to refer to what spark is calling a job. I guess application (more than one spark context) is what I'm asking about On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote: I've read in the

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang

Re: SQL query in scala API

2014-12-06 Thread Arun Luthra
Thanks, I will try this. On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian lian.cs@gmail.com wrote: Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write you own aggregation with aggregateByKey: users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) =

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Aaron Davidson
You can actually submit multiple jobs to a single SparkContext in different threads. In the case you mentioned with 2 stages having a common parent, both will wait for the parent stage to complete and then the two will execute in parallel, sharing the cluster resources. Solutions that submit

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
Update: It seems the following combo causes things in spark streaming to go missing: spark-core 1.1.0spark-streaming 1.1.0spark-cassandra-connector 1.1.0 The moment I add the three together, things like StreamingContext and Seconds are unavailable. sbt assembly fails saying those aren't there.

Re: Where can you get nightly builds?

2014-12-06 Thread Nicholas Chammas
To expand on Ted's response, there are currently no nightly builds published for users to use. You can watch SPARK-1517 (which Ted linked to) to be updated when that happens. On Sat Dec 06 2014 at 10:19:10 AM Ted Yu yuzhih...@gmail.com wrote: See

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
Hi,Just checked cassandra connector 1.1.0-beta1 runs fine. The issue seems to be 1.1.0 for spark streaming and 1.1.0 cassandra connector (final). Regards,Ashic. Date: Sat, 6 Dec 2014 13:52:20 -0500 Subject: Re: Adding Spark Cassandra dependency breaks Spark Streaming? From:

Re: Including data nucleus tools

2014-12-06 Thread Michael Armbrust
On Sat, Dec 6, 2014 at 5:53 AM, spark.dubovsky.ja...@seznam.cz wrote: Bonus question: Should the class org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly? Because it is not in jar now. No these jars cannot be put into the assembly because they have extra metadata files

Re: Modifying an RDD in forEach

2014-12-06 Thread Mohit Jaggi
Ron, “appears to be working” might be true when there are no failures. on large datasets being processed on a large number of machines, failures of several types(server, network, disk etc) can happen. At that time, Spark will not “know” that you changed the RDD in-place and will use any version

Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
This is perhaps more of a YARN question than a Spark question but i was just curious to how is memory allocated in YARN via the various configurations. For example, if I spin up my cluster with 4GB with a different number of executors as noted below 4GB executor-memory x 10 executors = 46GB

RE: Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
These are very interesting comments. The vast majority of cases I'm working on are going to be in the 3 million range and 100 million was thrown out as something to shoot for. I upped it to 500 million. But all things considering, I believe I may be able to directly translate what I have to

Re: Spark on YARN memory utilization

2014-12-06 Thread Arun Ahuja
Hi Denny, This is due the spark.yarn.memoryOverhead parameter, depending on what version of Spark you are on the default of this may differ, but it should be the larger of 1024mb per executor or .07 * executorMemory. When you set executor memory, the yarn resource request is executorMemory +

Re: Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
Got it - thanks! On Sat, Dec 6, 2014 at 14:56 Arun Ahuja aahuj...@gmail.com wrote: Hi Denny, This is due the spark.yarn.memoryOverhead parameter, depending on what version of Spark you are on the default of this may differ, but it should be the larger of 1024mb per executor or .07 *

run JavaAPISuite with mavem

2014-12-06 Thread Koert Kuipers
when i run mvn test -pl core, i dont see JavaAPISuite being run. or if it is, its being very very quiet about it. is this by design?

Re: run JavaAPISuite with mavem

2014-12-06 Thread Ted Yu
In master branch, I only found JavaAPISuite in comment: spark tyu$ find . -name '*.scala' -exec grep JavaAPISuite {} \; -print * For usage example, see test case JavaAPISuite.testJavaJdbcRDD. * converted into a `Object` array. For usage example, see test case JavaAPISuite.testJavaJdbcRDD.

Re: java.lang.ExceptionInInitializerError/Unable to load YARN support

2014-12-06 Thread maven
I noticed that when I unset HADOOP_CONF_DIR, I'm able to work in the local mode without any errors. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ExceptionInInitializerError-Unable-to-load-YARN-support-tp20560p20561.html Sent from the Apache

Re: run JavaAPISuite with mavem

2014-12-06 Thread Ted Yu
Pardon me, the test is here: sql/core/src/test/java/org/apache/spark/sql/api/java/JavaAPISuite.java You can run 'mvn test' under sql/core Cheers On Sat, Dec 6, 2014 at 5:55 PM, Ted Yu yuzhih...@gmail.com wrote: In master branch, I only found JavaAPISuite in comment: spark tyu$ find . -name

Recovered executor num in yarn-client mode

2014-12-06 Thread yuemeng1
Hi, all I have (maybe a clumsy) question about executor recovery num in yarn-client mode. My situation is as follows: We have a 1(resource manager) + 3(node manager) cluster, a app is running with one driver on the resource manager and 12 executors on all the node managers, and there are

vcores used in cluster metrics(yarn resource manager ui) when running spark on yarn

2014-12-06 Thread yuemeng1
Hi, all When i running an app with this cmd: ./bin/spark-sql --master yarn-client --num-executors 2 --executor-cores 3, i noticed that yarn resource manager ui shows the `vcores used` in cluster metrics is 3. It seems `vcores used` show wrong num (should be 7?)? Or i miss something?

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Ok, found another possible bug in Hive. My current solution is to use ALTER TABLE CHANGE to rename the column names. The problem is after renaming the column names, the value of the columns became all NULL. Before renaming: scala sql(select `sorted::cre_ts` from pmt limit 1).collect res12:

Re: run JavaAPISuite with mavem

2014-12-06 Thread Koert Kuipers
Ted, i mean core/src/test/java/org/apache/spark/JavaAPISuite.java On Sat, Dec 6, 2014 at 9:27 PM, Ted Yu yuzhih...@gmail.com wrote: Pardon me, the test is here: sql/core/src/test/java/org/apache/spark/sql/api/java/JavaAPISuite.java You can run 'mvn test' under sql/core Cheers On Sat,

Re: run JavaAPISuite with mavem

2014-12-06 Thread Ted Yu
I tried to run tests for core but there were failures. e.g. : ^[[32mExternalAppendOnlyMapSuite:^[[0m ^[[32m- simple insert^[[0m ^[[32m- insert with collision^[[0m ^[[32m- ordering^[[0m ^[[32m- null keys and values^[[0m ^[[32m- simple aggregator^[[0m ^[[32m- simple cogroup^[[0m Spark assembly has

Re: run JavaAPISuite with mavem

2014-12-06 Thread Michael Armbrust
Not sure about maven, but you can run that test with sbt: sbt/sbt sql/test-only org.apache.spark.sql.api.java.JavaAPISuite On Sat, Dec 6, 2014 at 9:59 PM, Ted Yu yuzhih...@gmail.com wrote: I tried to run tests for core but there were failures. e.g. : ^[[32mExternalAppendOnlyMapSuite:^[[0m

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hmm.. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 Jianshi On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? I'm currently converting each Map to a JSON String and