When I add
parts(0).collect().foreach(println)
parts(1).collect().foreach(println), for printing parts, I get the
following error
*not enough arguments for method collect: (pf:
PartialFunction[Char,B])(implicit
bf:scala.collection.generic.CanBuildFrom[String,B,That])That.Unspecified
value
Hello everyone,
I am transplanting a clustering algorithm to spark platform, and I meet
a problem confusing me for a long time, can someone help me?
I have a PairRDDInteger, Integer named patternRDD, which the key
represents a number and the value stores an information of the key. And I
*HI ALL:*
*My job is cpu intensive, and its resource configuration is 400 worker
* 1 core * 3G. There are many fetch failure, like:*
14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss
was due to fetch failure from BlockManagerId(slave1:33500)
14-08-23 08:34:52 INFO
\cc David Tompkins and Jim Donahue if they have anything to add.
\cc My school email. Please include bamos_cmu.edu for further discussion.
Hi Deb,
Debasish Das wrote
Looks very cool...will try it out for ad-hoc analysis of our datasets and
provide more feedback...
Could you please give bit
Hi all,
Is there someone that tried to pipe RDD into matlab script ? I'm trying to
do something similiar if one of you could point some hints.
Best regards,
Jao
On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote:
When I add
parts(0).collect().foreach(println)
parts(1).collect().foreach(println), for printing parts, I get the following
error
not enough arguments for method collect: (pf:
PartialFunction[Char,B])(implicit
Hi, All
When I run spark applications, I see from the web-ui that some stage
description are like apply at Option.scala:120.
Why spark splits a stage on a line that is not in my spark program but a Scala
library?
Thanks
Jensen
Hi,
Can someone help me with the following error:
scala val rdd = sc.parallelize(Array(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at console:12
scala rdd.persist(StorageLevel.MEMORY_ONLY)
console:15: error: not found: value StorageLevel
you need import StorageLevel by:
import org.apache.spark.storage.StorageLevel
taoist...@gmail.com
From: rapelly kartheek
Date: 2014-08-25 18:22
To: user
Subject: StorageLevel error.
Hi,
Can someone help me with the following error:
scala val rdd = sc.parallelize(Array(1,2,3,4))
rdd:
Hi,
Thanks for your help the other day. I had one more question regarding the
same.
If you want to issue an SQL statement on streaming data, you must have both
the registerAsTable() and the sql() call *within* the foreachRDD(...) block,
or -- as you experienced -- the table name will be unknown
Hi,
I'm working on big graph analytics, and currently implementing a mean field
inference algorithm in GraphX/Spark. I start with an arbitrary graph, keep a
(sparse) probability distribution at each node implemented as a
Map[Long,Double]. At each iteration, from the current estimates of the
Hi Guys,
I am currently playing with huge data.I have an RDD which returns
RDD[List[(tuples)]].I need only the tuples to be written to textfile output
using saveAsTextFile function.
example:val mod=modify.saveASTextFile() returns
Hello all,
Could someone help me with the manipulation of csv file data. I have
'semicolon' separated csv data including doubles and strings. I want to
calculate the maximum/average of a column. When I read the file using
sc.textFile(test.csv).map(_.split(;), each field is read as string.
On Thu, Aug 21, 2014 at 6:21 PM, pierred pie...@demartines.com wrote:
So, what is the accepted wisdom in terms of IDE and development environment?
I don't know what the accepted wisdom is. I've been getting by with the
Scala IDE for Eclipse, though I am using the stable version - as you noted,
Hi Patrick,
For the spilling within on key work you mention might land in Spark 1.2, is
that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or
is there another ticket I should be following?
Thanks!
Andrew
On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com
Do you want to do this on one column or all numeric columns?
On Mon, Aug 25, 2014 at 7:09 AM, Hingorani, Vineet vineet.hingor...@sap.com
wrote:
Hello all,
Could someone help me with the manipulation of csv file data. I have
'semicolon' separated csv data including doubles and strings. I
Hey Andrew,
We might create a new JIRA for it, but it doesn't exist yet. We'll create
JIRA's for the major 1.2 issues at the beginning of September.
- Patrick
On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash and...@andrewash.com wrote:
Hi Patrick,
For the spilling within on key work you mention
Hello Victor,
I want to do it on multiple columns. I was able to do it on one column by the
help of Sean using code below.
val matData = file.map(_.split(;))
val stats = matData.map(_(2).toDouble).stats()
stats.mean
stats.max
Thank you
Vineet
From: Victor Tso-Guillen
Hello All,
I have added a jar from S3 instance into classpath, i have tried following
options
1. sc.addJar(s3n://mybucket/lib/myUDF.jar)
2. hiveContext.sparkContext.addJar(s3n://mybucket/lib/myUDF.jar)
3. ./bin/spark-shell --jars s3n://mybucket/lib/myUDF.jar
I am getting ClassNotException when
I was able to get JavaWordCount running with a local instance under
IntelliJ.
In order to do so I needed to use maven to package my code and
call
String[] jars = {
/SparkExamples/target/word-count-examples_2.10-1.0.0.jar };
sparkConf.setJars(jars);
After that the sample ran properly and
flatMap() is a transformation only. Calling it by itself does nothing,
and it just describes the relationship between one RDD and another.
You should see it swing into action if you invoke an action, like
count(), on the words RDD.
On Mon, Aug 25, 2014 at 6:32 PM, Steve Lewis
That was not quite in English
My Flatmap code is shown below
I know the code is called since the answers are correct but would like to
put a break point in dropNonLetters to make sure that code works properly
I am running in the IntelliJ debugger but believe the code is executing on
a
Hi Dibyendu,
My colleague has taken a look at the spark kafka consumer github you have
provided and started experimenting.
We found that somehow when Spark has a failure after a data checkpoint, the
expected re-computations correspondent to the metadata checkpoints are not
recovered so we loose
Hi,
I am exploring GraphX library and trying to determine which usecases make
most sense for/with it. From what I initially thought, it looked like
GraphX could be applied to data stored in RDBMSs as Spark could translate
the relational data into graphical representation. However, there seems to
At 2014-08-25 06:41:36 -0700, BertrandR bertrand.rondepierre...@gmail.com
wrote:
Unfortunately, this works well for extremely small graphs, but it becomes
exponentially slow with the size of the graph and the number of iterations
(doesn't finish 20 iterations with graphs having 48000 edges).
At 2014-08-25 11:23:37 -0700, Sunita Arvind sunitarv...@gmail.com wrote:
Does this We introduce GraphX, which combines the advantages of both
data-parallel and graph-parallel systems by efficiently expressing graph
computation within the Spark data-parallel framework. We leverage new ideas
in
In general all PRs should be made against master. When necessary, we can
back port them to the 1.1 branch as well. However, since we are in
code-freeze for that branch, we'll only do that for major bug fixes at this
point.
On Thu, Aug 21, 2014 at 10:58 AM, Dmitriy Lyubimov dlie...@gmail.com
I am running a spark job on ~ 124 GB of data in a S3 bucket. The Job runs
fine but occasionally returns the following exception during the first map
stage which involves reading and transforming the data from S3. Is there a
config parameter I can set to increase this timeout limit?
*14/08/23
Have you tried the pipe() operator? It should work if you can launch your
script from the command line. Just watch out for any environment variables
needed (you can pass them to pipe() as an optional argument if there are some).
On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa
Just like with normal Spark Jobs, that command returns an RDD that contains
the lineage for computing the answer but does not actually compute the
answer. You'll need to run collect() on the RDD in order to get the result.
On Mon, Aug 25, 2014 at 11:46 AM, S Malligarjunan
Which version of Spark SQL are you using? Several issues with custom hive
UDFs have been fixed in 1.1.
On Mon, Aug 25, 2014 at 9:57 AM, S Malligarjunan
smalligarju...@yahoo.com.invalid wrote:
Hello All,
I have added a jar from S3 instance into classpath, i have tried following
options
1.
In our case, the ROW has about 80 columns which exceeds the case class
limit.
Starting with Spark 1.1 you'll be able to also use the applySchema API
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L126
.
One useful thing to do when you run into unexpected slowness is to run
'jstack' a few times on the driver and executors and see if there is any
particular hotspot in the Spark SQL code.
Also, it seems like a better option here might be to use the new
applySchema API
This should be fixed in the latest Spark. What branch are you running?
2014-08-25 1:32 GMT-07:00 Wang, Jensen jensen.w...@sap.com:
Hi, All
When I run spark applications, I see from the web-ui that some
stage description are like “apply at Option.scala:120”.
Why spark splits a
Thanks for this very thorough write-up and for continuing to update it as
you progress! As I said in the other thread it would be great to do a
little profiling to see if we can get to the heart of the slowness with
nested case classes (very little optimization has been done in this code
path).
Hi John,
I tried to follow your description but failed to reproduce this issue.
Would you mind to provide some more details? Especially:
-
Exact Git commit hash of the snapshot version you were using
Mine: e0f946265b9ea5bc48849cf7794c2c03d5e29fba
You could try to use foreachRDD on the result of countByWindow with a
function that performs the save operation.
On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote:
Hi,
Hopefully a simple question. Though is there an example of where to save
the output of countByWindow ? I
SO I tried the above (why doesn't union or ++ have the same behavior
btw?)
I don't think there is a good reason for this. I'd open a JIRA.
and it works, but is slow because the original Rdds are not
cached and files must be read from disk.
I also discovered you can recover the
Hi Guys,
I just want to know whether their is any way to determine which file is
being handled by spark from a group of files input inside a
directory.Suppose I have 1000 files which are given as input,I want to
determine which file is being handled currently by spark program so that if
any error
In general master should be a superset of what is in any of the release
branches. In the particular case of Spark SQL master and branch-1.1 should
be identical (though that will likely change once Patrick cuts the first
RC).
On Mon, Aug 25, 2014 at 12:50 PM, Dmitriy Lyubimov dlie...@gmail.com
PS from an offline exchange -- yes more is being called here, the rest
is the standard WordCount example.
The trick was to make sure the task executes locally, and calling
setMaster(local) on SparkConf in the example code does that. That
seems to work fine in IntelliJ for debugging this.
On Mon,
- dev list
+ user list
You should be able to query Spark SQL using JDBC, starting with the 1.1
release. There is some documentation is the repo
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server,
and we'll update the official docs once the
https://spark.apache.org/screencasts/1-first-steps-with-spark.html
The embedded YouTube video shows up in Safari on OS X but not in Chrome.
How come?
Nick
--
View this message in context:
Hi Du,
I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of
Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420
will release in Spark 1.2.0 according to the current JIRA status.
I'm tracking branch-1.1 instead of the master and haven't seen the
Hi,
I created an instance of LocalHiveContext and attempted to create a database.
However, it failed with message
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.RuntimeException: Unable to
Hi,
Does any one know whether Spark Streaming count the number of windows
processed? I am trying to keep a record of the result of processed windows
and corresponding timestamp. But I cannot find any related documents or
examples.
Thanks,
-JC
--
View this message in context:
I like this consumer for what it promises - better control over offset and
recovery from failures. If I understand this right, it still uses single
worker process to read from Kafka (one thread per partition) - is there a
way to specify multiple worker processes (on different machines) to read
Hi,
On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com
wrote:
If you want to issue an SQL statement on streaming data, you must have
both
the registerAsTable() and the sql() call *within* the foreachRDD(...)
block,
or -- as you experienced -- the table name will be
Hi again,
On Tue, Aug 26, 2014 at 10:13 AM, Tobias Pfeiffer t...@preferred.jp wrote:
On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991
praveshjain1...@gmail.com wrote:
If you want to issue an SQL statement on streaming data, you must have
both
the registerAsTable() and the sql() call
Hi,
I am able to access the Application details web page from the master UI page
when I run Spark in standalone mode on my local machine. However, I am not
able to access it when I run Spark on our private cluster. The Spark master
runs on one of the nodes in the cluster. I am able to access the
You can try to manipulate the string you want to output before saveAsTextFile,
something like
modify. flatMap(x=x).map{x=
val s=x.toString
s.subSequence(1,s.length-1)
}
Should have more optimized way.
Best Regards,
Raymond Liu
-Original Message-
From: yh18190
It seems to be because you went there with https:// instead of http://. That
said, we'll fix it so that it works on both protocols.
Matei
On August 25, 2014 at 1:56:16 PM, Nick Chammas (nicholas.cham...@gmail.com)
wrote:
https://spark.apache.org/screencasts/1-first-steps-with-spark.html
The
Im currently creating a subgraph using the vertex predicate:
subgraph(vpred = (vid,attr) = attr.split(,)(2)!=999)
but wondering if a subgraph can be created using the edge predicate, if so a
sample would be great :)
thanks
Dave
--
View this message in context:
Can you paste the code? It's unclear to me how/when the out of memory is
occurring without seeing the code.
On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote:
Hello everyone,
I am transplanting a clustering algorithm to spark platform, and I
meet a problem
Assuming the CSV is well-formed (every row has the same number of columns)
and every column is a number, this is how you can do it. You can adjust so
that you pick just the columns you want, of course, by mapping each row to
a new Array that contains just the column values you want. Just be sure
Hi Jonathan,
Thanks for the reply. I ran other exercises (movie recommendation and GraphX)
on the same cluster and did not see these errors. So I think this might not be
related to the memory setting..
Thanks,
Forest
On Aug 24, 2014, at 10:27 AM, Jonathan Haddad j...@jonhaddad.com wrote:
https://spark.apache.org/screencasts/1-first-steps-with-spark.html
The embedded YouTube video shows up in Safari on OS X but not in Chrome.
I’m using Chrome 36.0.1985.143 on MacOS 10.9.4 and it it works like a charm for
me.
Cheers,
Michael
--
Michael Hausenblas
Ireland,
57 matches
Mail list logo