There is a PR to fix this: https://github.com/apache/spark/pull/1802
On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller bmill...@eecs.berkeley.edu wrote:
I concur that printSchema works; it just seems to be operations that use the
data where trouble happens.
Thanks for posting the bug.
-Brad
Hi,
I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL
without problems:
val input_file = s3://bucket-name/test_data.txt
val rawdata = sc.textFile(input_file)
val test = rawdata.collect
but when I try to run a simple standalone application reading the same data,
I
I'm getting the same Input path does not exist error also after setting the
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using
the format s3://bucket-name/test_data.txt for the input file.
--
View this message in context:
Try s3n://
On Aug 6, 2014, at 12:22 AM, sparkuser2345 hm.spark.u...@gmail.com wrote:
I'm getting the same Input path does not exist error also after setting the
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using
the format s3://bucket-name/test_data.txt for the
Hi Andrew,
Thank you very much for your solution, it works like a charm, and for very
clear explanation.
Grzegorz
1 I don't use spark_submit to run my problem and use spark context directly
val conf = new SparkConf()
.setMaster(spark://123d101suse11sp3:7077)
.setAppName(LBFGS)
.set(spark.executor.memory, 30g)
.set(spark.akka.frameSize,20)
val sc = new
Hi
I am trying to find belief for a graph using GraphX Pragel implementation
.My use case is like if vertex 2,3,4 are sending message m2,m3,m4 to vertex
6 .In vertex 6 I will multiple all the messages (m2*m3*m4) =m6 and then
from vertex6 the message (m6/m2) will be send to vertex 2,m6/m3 to
Hi all,
My name is Andres and I'm starting to use Apache Spark.
I try to submit my spark.jar to my cluster using this:
spark-submit --class net.redborder.spark.RedBorderApplication --master
spark://pablo02:7077 redborder-spark-selfcontained.jar
But when I did it .. My worker die .. and my
Looks like a netty conflict there, most likely you are having mutiple
versions of netty jars (eg:
netty-3.6.6.Final.jar, netty-3.2.2.Final.jar, netty-all-4.0.13.Final.jar),
you only require 3.6.6 i believe. a quick fix would be to remove the rest
of them.
Thanks
Best Regards
On Wed, Aug 6, 2014
Evan R. Sparks wrote
Try s3n://
Thanks, that works! In REPL, I can succesfully load the data using both
s3:// and s3n://, why the difference?
--
View this message in context:
Spark reported error java.lang.IllegalArgumentException with messages:
java.lang.IllegalArgumentException: requirement failed: Found fields with
the same name.
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)
Hello,
I have referred link https://github.com/dibbhatt/kafka-spark-consumer; and
I have successfully consumed tuples from kafka.
Tuples are JSON objects and I want to store that objects in HDFS as parque
format.
Please suggest me any sample example for that.
Thanks in advance.
On Tue, Aug
Hi Vida,
It's possible to save an RDD as a hadoop file using hadoop output formats. It
might be worthwhile to investigate using DBOutputFormat and see if this will
work for you.
I haven't personally written to a db, but I'd imagine this would be one way
to do it.
Thanks,
Ron
Sent from my
See here: https://wiki.apache.org/hadoop/AmazonS3
s3:// refers to a block storage system and is deprecated
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html.
Use s3n:// for regular files you can see in the S3 web console.
Nick
On Wed, Aug 6, 2014 at
Hi folks, hoping someone who works with Streaming can help me out.
I have the following snippet:
val stateDstream =
data.map(x = (x, 1))
.updateStateByKey[State](updateFunc)
stateDstream.saveAsTextFiles(checkpointDirectory, partitions_test)
where data is a RDD of
case class
Nice catch Brad and thanks to Yin and Davies for getting on it so quickly.
On Wed, Aug 6, 2014 at 2:45 AM, Davies Liu dav...@databricks.com wrote:
There is a PR to fix this: https://github.com/apache/spark/pull/1802
On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller bmill...@eecs.berkeley.edu
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver?
I think the problem is with the feature dimension. KDD data has more than
20M features and in v1.0.1, the driver collects the partial gradients one
by one, sums them up, does the update, and then sends the new weights back
Hi Vida,
I am writing to a DB -- or trying to :).
I believe the best practice for this (you can search the mailing list
archives) is to do a combination of mapPartitions and use a grouped
iterator.
Look at this thread, esp. the comment from A. Boisvert and Matei's comment
above it:
Hello,
I'm interested in getting started with Spark to scale our scientific analysis
package (http://pynbody.github.io) to larger data sets. The package is written
in Python and makes heavy use of numpy/scipy and related frameworks. I've got a
couple of questions that I have not been able to
My company is leaning towards moving much of their analytics work from our
own Spark/Mesos/HDFS/Cassandra set up to RedShift. To date, I have been
the internal advocate for using Spark for analytics, but a number of good
points have been brought up to me. The reasons being pushed are:
-
Thanks. This worked :). I am thinking I should add this in spark-env.sh so
that spark-shell always connects to master be default.
On Aug 6, 2014 12:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You can always start your spark-shell by specifying the master as
1) We get tooling out of the box from RedShift (specifically, stable JDBC
access) - Spark we often are waiting for devops to get the right combo of
tools working or for libraries to support sequence files.
The arguments about JDBC access and simpler setup definitely make sense. My
first
numpy array only can support basic types, so we can not use it during collect()
by default.
Could you give a short example about how numpy array is used in your project?
On Wed, Aug 6, 2014 at 8:41 AM, Rok Roskar rokros...@gmail.com wrote:
Hello,
I'm interested in getting started with Spark
Update: I can get it to work by disabling iptables temporarily. I can,
however, not figure out on which port I have to accept traffic. 4040 and any
of the Master or Worker ports mentioned in the previous post don't work.
Can it be one of the randomly assigned ones in the 30k to 60k range? Those
Just to point out that the benchmark you point to has Redshift running on HDD
machines instead of SSD, and it is still faster than Shark in all but one case.
Like Gary, I'm also interested in replacing something we have on Redshift with
Spark SQL, as it will give me much greater capability to
Hi Andres,
If you're using the EC2 scripts to start your standalone cluster, you can
use ~/spark-ec2/copy-dir --delete ~/spark to sync your jars across the
cluster. Note that you will need to restart the Master and the Workers
afterwards through sbin/start-all.sh and sbin/stop-all.sh. If you're
Hi,
I'm having the exact same problem - I'm on a VPN and I'm trying to set the
proproperties spark.httpBroadcast.uri and spark.fileserver.uri so that they
bind to my VPN ip instead of my regular network IP. Were you ever able to
get this working?
Cheers,
-Rob
--
View this message in
Hi Simon,
The drivers and executors currently choose random ports to talk to each
other, so the Spark nodes will have to have full TCP access to each other.
This is changed in a very recent commit, where all of these random ports
will become configurable:
I want to use pagerank on a 3GB textfile, which contains a bipartite list
with variables id and brand.
Example:
id,brand
86246,15343
86246,27873
86246,14647
86246,55172
86246,3293
86246,2820
86246,3830
86246,2820
86246,5603
86246,72482
To perform the page rank I have to create a graph object,
Hi Gary,
This has indeed been a limitation of Spark, in that drivers and executors
use random ephemeral ports to talk to each other. If you are submitting a
Spark job from your local machine in client mode (meaning, the driver runs
on your machine), you will need to open up all TCP ports from
This will be awesome - it's been one of the major issues for our analytics
team as they hope to use their own python libraries.
On Wed, Aug 6, 2014 at 2:40 PM, Andrew Or and...@databricks.com wrote:
Hi Gary,
This has indeed been a limitation of Spark, in that drivers and executors
use
Finally I made it work. The trick was in asSubclass method:
val mongoRDD = sc.newAPIHadoopFile(file:///root/jobs/dump/input.bson,
classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object,
BSONObject]]), classOf[Object], classOf[BSONObject],
Hi Andrew,
for this test I only have one machine which provides the master and only
worker.
So all I'd need is communication to the Internet to access the twitter API.
I've tried assigning a specific port to the driver and creating iptables rules
for this port, but that didn't work.
Best
I'm sure this must be a fairly common use-case for spark, yet I have not
found a satisfactory discussion of it on the spark website or forum:
I work at a company with a lot of previous-generation server hardware
sitting idle-- I want to add this hardware to my spark cluster to increase
I have a few questions about managing Spark memory:
1) In a standalone setup, is their any cpu prioritization across users
running jobs? If so, what is the behavior here?
2) With Spark 1.1, users will more easily be able to run drivers/shells
from remote locations that do not cause firewall
Hi,
I am trying to build jars using the command :
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
Execution of the above command is throwing the following error:
[INFO] Spark Project Core . FAILURE [ 0.295 s]
[INFO] Spark Project Bagel
Thank you TD,
I have worked around that problem and now the test compiles.
However, I don't actually see that test running. As when I do mvn test, it
just says BUILD SUCCESS, without any TEST section on stdout.
Are we suppose to use mvn test to run the test? Are there any other
methods can be
Forgot to cc the mailing list :)
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Agreed. Being able to use SQL to make a table, pass it to a graph
algorithm, pass that output to a machine learning algorithm, being able to
invoke user defined python
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Mostly I was just objecting to Redshift does very well, but Shark is on
par or better than it in most of the tests when that was not how I read
the results, and Redshift was on HDDs.
My bad. You are
Also, regarding something like redshift not having MLlib built in, much of
that could be done on the derived results.
On Aug 6, 2014 4:07 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Mostly I was
The method
def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) =
Option[S] ): DStream[(K, S)]
takes Dstream (K,V) and Produces DStream (K,S) in Spark Streaming
We have a input Dstream(K,V) that has 40,000 elements. We update on average
of 1000 elements of them in every 3
Well yes, MLlib-like routines or pretty much anything else could be run on the
derived results, but you have to unload the results from Redshift and then load
them into some other tool. So it's nicer to leave them in memory and operate on
them there. Major architectural advantage to Spark.
Ron
Yeah, I have observed this common problem in the design a number of times
in the mailing list. Tobias, what you linked to is also an additional
problem, that occurs with mapPartitions, but not with foreachPartitions
(which is relevant here). But I do get your point. I think there was an
attempt
On Wed, Aug 6, 2014 at 4:30 PM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Major architectural advantage to Spark.
Amen to that. For a really cool and succinct demonstration of this, check
out Aaron's demo http://youtu.be/sPhyePwo7FA?t=10m16s at the Hadoop
Summit earlier this ear
The output of lapply and lapplyPartition should the same by design -- The
only difference is that in lapply the user-defined function returns a row,
while it returns a list in lapplyPartition.
Could you given an example of a small input and output that you expect to
see for the above program ?
Why cant you keep a persistent queue of S3 files to process? The process
that you are running has two threads
Thread 1: Continuously gets SQS messages and write to the queue. This queue
is persisted to the reliable storage, like HDFS / S3.
Thread 2: Peek into the queue, and whenever there is any
Depending on the density of your keys, the alternative signature
def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ?
Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner:
Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)]
at least iterates by key rather than
Hello Venkat,
Your thoughts are quite spot on. The current implementation was designed to
allow the functionality of timing out a state. For this to be possible, the
update function need to be called each key even if there is no new data, so
that the function can check things like last update
Hey Gary,
The answer to both of your questions is that much of it is up to the
application.
For (1), the standalone master can set spark.deploy.defaultCores to limit
the number of cores each application can grab. However, the application can
override this with the applications-specific
Does it not show the name of the testsuite on stdout, showing that it has
passed? Can you try writing a small test unit-test, in the same way as
your kafka unit test, and with print statements on stdout ... to see
whether it works? I believe it is some configuration issue in maven, which
is hard
I'm running on spark 1.0.0 and I see a similar problem when using the
socketTextStream receiver. The ReceiverTracker task sticks around after a
ssc.stop(false).
--
View this message in context:
You can use SparkSQL for that very easily. You can convert the rdds you get
from kafka input stream, convert them to a RDDs of case classes and save as
parquet files.
More information here.
https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
On Wed, Aug 6, 2014 at 5:23
You probably have only 10 cores in your cluster on which you are executing
your job. since each dstream / receiver take one core each, the system is
not able to start all of them and so everything is blocked.
On Wed, Aug 6, 2014 at 3:08 AM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid
wrote:
Hi,
Can you give the stack trace?
This was the fix for the twitter stream.
https://github.com/apache/spark/pull/1577/files
You could try doing the same.
TD
On Wed, Aug 6, 2014 at 2:41 PM, lbustelo g...@bustelos.com wrote:
I'm running on spark 1.0.0 and I see a similar problem when using the
Hi,
I have a DStream called eventData and it contains set of Data objects
defined as followed:
case class Data(startDate: Long, endDate: Long, className: String, id:
String, state: String)
How would the reducer and inverse reducer functions look like if I would
like to add the data for current
1) How is the minPartitions parameter in NaiveBayes example used? What is
the default value?
2) Why is the numFeatures specified as a parameter? Can this not be
obtained from the data? This parameter is not specified for the other MLlib
algorithms.
thanks
--
View this message in context:
Why isnt a simple window function sufficient?
eventData.window(Minutes(15), Seconds(3)) will keep generating RDDs every 3
second, each containing last 15 minutes of data.
TD
On Wed, Aug 6, 2014 at 3:43 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
I have a DStream called eventData and it
Hi,
I am trying to look at for instance the following SQL query in Spark 1.1:
SELECT table.key, table.value, table2.value FROM table2 JOIN table WHERE
table2.key = table.key
When I look at the output, I see that there are several stages, and several
tasks per stage. The tasks have a TID, I do not
Hello,
I am trying to use the python IDE PyCharm for Spark application
development. How can I use pyspark with Python IDE? Can anyone help me with
this?
Thanks
Sathish
My naive set up..
Adding
os.environ['SPARK_HOME'] = /path/to/spark
sys.path.append(/path/to/spark/python)
on top of my script.
from pyspark import SparkContext
from pyspark import SparkConf
Execution works from within pycharm...
Though my next step is to figure out autocompletion and I bet there
I posted this in cdh-user mailing list yesterday and think this should have
been the right audience for this:
=
Hi All,
Not sure if anyone else faced this same issue or not.
We installed CDH 4.6 that uses Hive 0.10.
And we have Spark 0.9.1 that comes with Hive 11.
Now our hive jobs that
I haven't tried any of this, mind you, but my guess is that your options
are, from least painful and most likely to work onwards, are:
- Get Spark / Shark to compile against Hive 0.10
- Shade Hive 0.11 into Spark
- Update to CDH5.0+
I don't think there will be more updated releases of Shark or
Hey Aniket,
Great thoughts! I understand the usecase. But as you have realized yourself
it is not trivial to cleanly stream a RDD as a DStream. Since RDD
operations are defined to be scan based, it is not efficient to define RDD
based on slices of data within a partition of another RDD, using
This is maybe not exactly what you are asking for, but you might consider
looking at the queryExecution (a developer API that shows how the query is
analyzed / executed)
sql(...).queryExecution
On Wed, Aug 6, 2014 at 3:55 PM, Tom thubregt...@gmail.com wrote:
Hi,
I am trying to look at for
(Forgot to include the mailing list in my reply. Here it is.)
Hi,
On Thu, Aug 7, 2014 at 7:55 AM, Tom thubregt...@gmail.com wrote:
When I look at the output, I see that there are several stages, and several
tasks per stage. The tasks have a TID, I do not see such a thing for a
stage.
They
I narrowed down the error. Unfortunately this is not quick fix. I have
opened a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-2892
On Wed, Aug 6, 2014 at 3:59 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Okay let me give it a shot.
On Wed, Aug 6, 2014 at 3:57 PM,
Hi,
That is interesting. Would you please share some code on how you are setting
the regularization type, regularization parameters and running Logistic
Regression?
Thanks,
Burak
- Original Message -
From: SK skrishna...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Wednesday,
Hi,
I get a lot of executor lost error for saveAsTextFile with PySpark
and Hadoop 2.4.
For small datasets this error occurs but since the dataset is small it
gets eventually written to the file.
For large datasets, it takes forever to write the final output.
Any help is appreciated.
Avishek
Mohit, This doesn't seems to be working can you please provide more
details? when I use from pyspark import SparkContext it is disabled in
pycharm. I use pycharm community edition. Where should I set the
environment variables in same python script or different python script?
Also, should I run
Hi There,
I'm starting using spark and got a rookie problem. I used the standalone and
master only, and here is what I did:
./sbin/start-master.sh
./bin/pyspark
When I tried the example of wordcount.py, which my input file is a bit big,
about I got the out of memory error, which I excerpted
One possible straightforward explanation might be your solution(s) might be
stuck in local minima?? And depending on your weights initialization, you
are getting different parameters?
Maybe have same initial weights for both the runs...
or
I would probably test the execution with synthetic
Hi,
The reason I am looking to do it differently is because the latency and
batch processing times are bad about 40 sec. I took the times from the
Streaming UI.
As you suggested I tried the window as below and still the times are bad.
val dStream = KafkaUtils.createStream(ssc, zkQuorum, group,
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are
Strings holding the contents of those (UTF-8) files, but NOT split into lines.
Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB?
Similarly, can such a thing be registered as a table so that I can
I have test it in spark-1.1.0-SNAPSHOT.
It is ok now
发件人: Xiangrui Meng [mailto:men...@gmail.com]
发送时间: 2014年8月6日 23:12
收件人: Lizhengbing (bing, BIPA)
抄送: user@spark.apache.org
主题: Re: fail to run LBFS in 5G KDD data in spark 1.0.1?
Do you mind testing 1.1-SNAPSHOT and allocating more memory to
Hello
I'm trying to modify Spark sample app to integrate with Cassandra, however
I saw exception when submitting the app. Anyone knows why it happens?
Exception in thread main java.lang.NoClassDefFoundError:
com/datastax/spark/connector/rdd/reader/RowReaderFactory
at
75 matches
Mail list logo