Great to hear!
-Sandy
On Fri, Dec 5, 2014 at 11:17 PM, Denny Lee denny.g@gmail.com wrote:
Okay, my bad for not testing out the documented arguments - once i use the
correct ones, the query shrinks completes in ~55s (I can probably make it
faster). Thanks for the help, eh?!
On Fri
With that said, and the nature of iterative algorithms that Spark is advertised
for, isn't this a bit of an unnecessary restriction since I don't see where the
problem is. For instance, it is clear that when aggregating you need operations
to be associative because of the way they are divided
I guess a major problem with this is that you lose fault tolerance.
You have no way of recreating the local state of the mutable RDD if a
partition is lost.
Why would you need thousands of RDDs for kmeans? it's a few per iteration.
An RDD is more bookkeeping that data structure, itself. They
This is from a separate thread with a differently named title.
Why can't you modify the actual contents of an RDD using forEach? It appears to
be working for me. What I'm doing is changing cluster assignments and distances
per data item for each iteration of the clustering algorithm. The
Hiearchical K-means require a massive amount of iterations whereas flat K-means
does not but I've found flat to be generally useless since in most UIs it is
nice to be able to drill down into more and more specific clusters. If you have
100 million documents and your branching factor is 8
You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how
it optimizes execution of iterative jobs.
Simple answer is
1. Spark doesn't materialize RDD when you do an iteration but lazily
captures the transformation functions in RDD.(only function and closure ,
no data operation
Very interesting, the line doing drop table will throws an exception. After
removing it all works.
Jianshi
On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Here's the solution I got after talking with Liancheng:
1) using backquote `..` to wrap up all illegal
Hi again,
I have tried to recompile and run this again with new assembly created by
./make-distribution.sh -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.3 -Pyarn -
Phive -DskipTests
It results in exactly the same error. Any other hints?
Bonus question: Should the class org.datanucleus.api.jdo.
When I run a groupByKey it seems to create a single tasks after the
groupByKey that never stops executing. I'm loading a smallish json dataset
that is 4 million. This is the code I'm running.
rdd = sql_context.jsonFile(uri)
rdd = rdd.cache()
grouped = rdd.map(lambda row: (row.id,
You could try increasing the level of parallelism
(spark.default.parallelism) while creating the sparkContext
Thanks
Best Regards
On Fri, Dec 5, 2014 at 6:37 PM, Antony Mayi antonym...@yahoo.com.invalid
wrote:
Hi,
using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel -
I
See https://amplab.cs.berkeley.edu/jenkins/view/Spark/
See also https://issues.apache.org/jira/browse/SPARK-1517
Cheers
On Sat, Dec 6, 2014 at 6:41 AM, Simone Franzini captainfr...@gmail.com
wrote:
I recently read in the mailing list that there are now nightly builds
available. However, I
Hi,It appears that spark is always attempting to use the driver's hostname to
connect / broadcast. This is usually fine, except when the cluster doesn't have
DNS configured. For example, in a vagrant cluster with a private network. The
workers and masters, and the host (where the driver runs
Reading the documentation a little more closely, I'm using the wrong
terminology. I'm using stages to refer to what spark is calling a job. I
guess application (more than one spark context) is what I'm asking about
On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote:
I've read in the
Hmm... another issue I found doing this approach is that ANALYZE TABLE ...
COMPUTE STATISTICS will fail to attach the metadata to the table, and later
broadcast join and such will fail...
Any idea how to fix this issue?
Jianshi
On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang
Thanks, I will try this.
On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian lian.cs@gmail.com wrote:
Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write
you own aggregation with aggregateByKey:
users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) =
You can actually submit multiple jobs to a single SparkContext in different
threads. In the case you mentioned with 2 stages having a common parent,
both will wait for the parent stage to complete and then the two will
execute in parallel, sharing the cluster resources.
Solutions that submit
Update:
It seems the following combo causes things in spark streaming to go missing:
spark-core 1.1.0spark-streaming 1.1.0spark-cassandra-connector 1.1.0
The moment I add the three together, things like StreamingContext and Seconds
are unavailable. sbt assembly fails saying those aren't there.
To expand on Ted's response, there are currently no nightly builds
published for users to use. You can watch SPARK-1517 (which Ted linked to)
to be updated when that happens.
On Sat Dec 06 2014 at 10:19:10 AM Ted Yu yuzhih...@gmail.com wrote:
See
Hi,Just checked cassandra connector 1.1.0-beta1 runs fine. The issue seems
to be 1.1.0 for spark streaming and 1.1.0 cassandra connector (final).
Regards,Ashic.
Date: Sat, 6 Dec 2014 13:52:20 -0500
Subject: Re: Adding Spark Cassandra dependency breaks Spark Streaming?
From:
On Sat, Dec 6, 2014 at 5:53 AM, spark.dubovsky.ja...@seznam.cz wrote:
Bonus question: Should the class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory be part of assembly?
Because it is not in jar now.
No these jars cannot be put into the assembly because they have extra
metadata files
Ron,
“appears to be working” might be true when there are no failures. on large
datasets being processed on a large number of machines, failures of several
types(server, network, disk etc) can happen. At that time, Spark will not
“know” that you changed the RDD in-place and will use any version
This is perhaps more of a YARN question than a Spark question but i was
just curious to how is memory allocated in YARN via the various
configurations. For example, if I spin up my cluster with 4GB with a
different number of executors as noted below
4GB executor-memory x 10 executors = 46GB
These are very interesting comments. The vast majority of cases I'm working on
are going to be in the 3 million range and 100 million was thrown out as
something to shoot for. I upped it to 500 million. But all things considering,
I believe I may be able to directly translate what I have to
Hi Denny,
This is due the spark.yarn.memoryOverhead parameter, depending on what
version of Spark you are on the default of this may differ, but it should
be the larger of 1024mb per executor or .07 * executorMemory.
When you set executor memory, the yarn resource request is executorMemory +
Got it - thanks!
On Sat, Dec 6, 2014 at 14:56 Arun Ahuja aahuj...@gmail.com wrote:
Hi Denny,
This is due the spark.yarn.memoryOverhead parameter, depending on what
version of Spark you are on the default of this may differ, but it should
be the larger of 1024mb per executor or .07 *
when i run mvn test -pl core, i dont see JavaAPISuite being run. or if it
is, its being very very quiet about it. is this by design?
In master branch, I only found JavaAPISuite in comment:
spark tyu$ find . -name '*.scala' -exec grep JavaAPISuite {} \; -print
* For usage example, see test case JavaAPISuite.testJavaJdbcRDD.
* converted into a `Object` array. For usage example, see test case
JavaAPISuite.testJavaJdbcRDD.
I noticed that when I unset HADOOP_CONF_DIR, I'm able to work in the local
mode without any errors.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ExceptionInInitializerError-Unable-to-load-YARN-support-tp20560p20561.html
Sent from the Apache
Pardon me, the test is here:
sql/core/src/test/java/org/apache/spark/sql/api/java/JavaAPISuite.java
You can run 'mvn test' under sql/core
Cheers
On Sat, Dec 6, 2014 at 5:55 PM, Ted Yu yuzhih...@gmail.com wrote:
In master branch, I only found JavaAPISuite in comment:
spark tyu$ find . -name
Hi, all
I have (maybe a clumsy) question about executor recovery num in
yarn-client mode. My situation is as follows:
We have a 1(resource manager) + 3(node manager) cluster, a app is
running with one driver on the resource manager and 12 executors on all
the node managers,
and there are
Hi, all
When i running an app with this cmd: ./bin/spark-sql --master
yarn-client --num-executors 2 --executor-cores 3, i noticed that yarn
resource manager ui shows the `vcores used` in cluster metrics is 3. It
seems `vcores used` show wrong num (should be 7?)? Or i miss something?
Ok, found another possible bug in Hive.
My current solution is to use ALTER TABLE CHANGE to rename the column names.
The problem is after renaming the column names, the value of the columns
became all NULL.
Before renaming:
scala sql(select `sorted::cre_ts` from pmt limit 1).collect
res12:
Ted,
i mean
core/src/test/java/org/apache/spark/JavaAPISuite.java
On Sat, Dec 6, 2014 at 9:27 PM, Ted Yu yuzhih...@gmail.com wrote:
Pardon me, the test is here:
sql/core/src/test/java/org/apache/spark/sql/api/java/JavaAPISuite.java
You can run 'mvn test' under sql/core
Cheers
On Sat,
I tried to run tests for core but there were failures. e.g. :
^[[32mExternalAppendOnlyMapSuite:^[[0m
^[[32m- simple insert^[[0m
^[[32m- insert with collision^[[0m
^[[32m- ordering^[[0m
^[[32m- null keys and values^[[0m
^[[32m- simple aggregator^[[0m
^[[32m- simple cogroup^[[0m
Spark assembly has
Not sure about maven, but you can run that test with sbt:
sbt/sbt sql/test-only org.apache.spark.sql.api.java.JavaAPISuite
On Sat, Dec 6, 2014 at 9:59 PM, Ted Yu yuzhih...@gmail.com wrote:
I tried to run tests for core but there were failures. e.g. :
^[[32mExternalAppendOnlyMapSuite:^[[0m
Hmm..
I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782
Jianshi
On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD?
I'm currently converting each Map to a JSON String and
36 matches
Mail list logo