Hi Maria,
Have you tried the 8080 as well ?
Thanks
Himanshu
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Akhil,
(Your reply does not appear in the mailing list but I received an email so I
will reply here).
I have an application running already in the shell using pyspark. I can see
the application running on port 8080, but I cannot log into it through port
4040. It says connection timed out
Hi Akhil,
Thanks for your reply! I still cannot see port 4040 in my machine when I
type master-ip-address:4040 in my browser.
I have tried this command: netstat -nat | grep 4040 and it returns this:
tcp0 0 :::4040 :::*
LISTEN
Logging into
Hi,
I am using Spark 1.3.1 standalone and I have a problem where my cluster is
working fine, I can see the port 8080 and check that my ec2 instances are
fine, but I cannot access port 4040.
I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark
context and restarting it to no
Note: CCing user@spark.apache.org
First, you must check if the RDD is empty:
messages.foreachRDD { rdd =
if (!rdd.isEmpty) { }}
Now, you can obtain the instance of a SQLContext:
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
4040 is your driver port, you need to run some application. Login to your
cluster start a spark-shell and try accessing 4040.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote:
Hi,
I am using Spark 1.3.1 standalone and I have a problem where my cluster is
Additionally, if I delete the parquet and recreate it using the same generic
save function with 1000 partitions and overwrite the size is again correct.
--
View this message in context:
RDD's are immutable, why not join two DStreams?
Not sure, but you can try something like this also:
kvDstream.foreachRDD(rdd = {
val file = ssc.sparkContext.textFile(/sigmoid/)
val kvFile = file.map(x = (x.split(,)(0), x))
rdd.join(kvFile)
})
Thanks
Best Regards
On
hi,
i have an idea to solve my problem, i want write one file for each spark
partion,
but i not know to get the actuel partion suffix/ID in my call function?
points.foreachPartition(
new VoidFunctionIteratorTuple2Integer,
GeoTimeDataTupel() {
private static
Hi,
Kudos on Spark 1.3.x, it's a great release - loving data frames!
One thing I noticed after upgrading is that if I use the generic save
DataFrame function with Overwrite mode and a parquet source it produces
much larger output parquet file.
Source json data: ~500GB
Originally saved parquet:
Hi,
I'm gathering that the typical approach for splitting an RDD is to apply
several filters to it.
rdd1 = rdd.filter(func1);
rdd2 = rdd.filter(func2);
...
Is there/should there be a way to create 'buckets' like these in one go?
ListRDD rddList = rdd.filter(func1, func2, ..., funcN)
Another
Hello all.
I've been reading some old mails and notice that the use of kerberos in a
standalone cluster was not supported. Is this stillt he case?
Thanks.
Borja.
--
View this message in context:
Opening your 4040 manually or ssh tunneling (ssh -L 4040:127.0.0.1:4040
master-ip, and then open localhost:4040 in browser.) will work for you then
.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 5:10 PM, mrm ma...@skimlinks.com wrote:
Hi Akhil,
Thanks for your reply! I still cannot see port
Both the driver (ApplicationMaster running on hadoop) and container
(CoarseGrainedExecutorBackend) end up exceeding my 25GB allocation.
my code is something like
sc.binaryFiles(... 1mil xml files).flatMap( ... extract some domain classes,
not many though as each xml usually have zero
Hi!
I'm struggling with an issue with Spark 1.3.1 running on YARN, running on
an AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux
2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop
2.4, etc...). I make use of the AWS emr-bootstrap-action
Hi all
Recently I have learned about 1.3 spark core source code , can't
understand rpc, How to communicate between client driver, worker and
master?
There are some scala files such as RpcCallContextRpcEndPointRef
RpcEndpoint RpcEnv. On spark core rpc module
Have any blogs ?
Hi Jeroen,
Rather than bundle the Phoenix client JAR with your app, are you able to
include it in a static location either in the SPARK_CLASSPATH, or set the
conf values below (I use SPARK_CLASSPATH myself, though it's deprecated):
spark.driver.extraClassPath
spark.executor.extraClassPath
Hi Cheng,
I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an
existing parquet file, then repartitioning and saving it. Doing this gives
the error. The code for this doesn't look like causing problem. I have a
feeling the source - the existing parquet is the culprit.
I
The new RPC interface is an internal module and added in 1.4. It should not
exist in 1.3. Where did you find it?
For the communication between driver, worker and master, it still uses
Akka. There are a pending PR to update them:
https://github.com/apache/spark/pull/5392 Do you mean the
Hm, I tried the following with 0.13.1 and 0.13.0 on my laptop (don't
have access to a cluster for now) but couldn't reproduce this issue.
Your program just executed smoothly... :-/
Command line used to start the Thrift server:
./sbin/start-thriftserver.sh --driver-memory 4g --master local
Hi George,
I have same issue, did you manage to find a solution?
best,
/Shahab
On Wed, May 13, 2015 at 9:21 PM, George Adams g.w.adams...@gmail.com
wrote:
Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my
classpath. I’ve outlined the issue on Stack Overflow (
Hi Josh,
Thank you for your effort. Looking at your code, I feel that mine is
semantically the same, except written in Java. The dependencies in the pom.xml
all have the scope provided. The job is submitted as follows:
$ rm spark.log MASTER=spark://maprdemo:7077
May be you should update your spark version to the latest one.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 11:04 AM, Chandrashekhar Kotekar
shekhar.kote...@gmail.com wrote:
Hi,
I have configured Spark to run on YARN. Whenever I start spark shell using
'spark-shell' command, it
Also, if the data isn't confidential, would you mind to send me a
compressed copy (don't cc user@spark.apache.org)?
Cheng
On 6/10/15 4:23 PM, 姜超才 wrote:
Hi Lian,
Thanks for your quick response.
I forgot mention that I have tuned driver memory from 2G to 4G, seems
got minor improvement, The
Hope Swig http://www.swig.org/index.php and JNA
https://github.com/twall/jna/ might help for accessing c++ libraries from
Java.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 11:50 AM, mahesht mahesh.s.tup...@gmail.com wrote:
There is C++ component which uses some model which we want to replace
On 6/10/15 1:55 AM, James Pirz wrote:
I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on
Hadoop 2.6.
I looked the ThriftServer2 logs, and I realized that the server was
not starting properly, because of failure in creating a server socket.
In fact, I had passed the URI to
Hi Akshat,
I assume what you want is to make sure the number of partitions in your RDD,
which is easily achievable by passing numSlices and minSplits argument at
the time of RDD creation. example :
val someRDD = sc.parallelize(someCollection, numSlices) /
val someRDD = sc.textFile(pathToFile,
Hm, this is a common confusion... Although the variable name is
`sqlContext` in Spark shell, it's actually a `HiveContext`, which
extends `SQLContext` and has the ability to communicate with Hive metastore.
So your program need to instantiate a
`org.apache.spark.sql.hive.HiveContext` instead.
thanks Ak, thanks for your idea. I had tried using spark to do what the
shell did. However it is not fast enough as I expected and not very easy.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉
Would you mind to provide executor output so that we can check the
reason why executors died?
And you may run EXPLAIN EXTENDED to find out the physical plan of your
query, something like:
|0: jdbc:hive2://localhost:1 explain extended select * from foo;
Hi Xiaohan,
Would you please try to set spark.sql.thriftServer.incrementalCollect
to true and increasing driver memory size? In this way,
HiveThriftServer2 uses RDD.toLocalIterator rather than
RDD.collect().iterator to return the result set. The key difference is
that RDD.toLocalIterator
Hi Sam,
You might want to have a look at spark UI which runs by default at
localhost://8080. You can also configure Apache Ganglia to monitor over your
cluster resources.
Thank you
Regards
Himanshu Mehra
--
View this message in context:
Would you please also provide executor stdout and stderr output? Thanks.
Cheng
On 6/10/15 4:23 PM, 姜超才 wrote:
Hi Lian,
Thanks for your quick response.
I forgot mention that I have tuned driver memory from 2G to 4G, seems
got minor improvement, The dead way when fetching 1,400,000 rows
Michael had answered this question in the SO thread
http://stackoverflow.com/a/30226336
Cheng
On 6/10/15 9:24 PM, shahab wrote:
Hi George,
I have same issue, did you manage to find a solution?
best,
/Shahab
On Wed, May 13, 2015 at 9:21 PM, George Adams g.w.adams...@gmail.com
Hi,
if you now want to write 1 file per partition, that's actually built into
Spark as
*saveAsTextFile*(*path*)Write the elements of the dataset as a text file
(or set of text files) in a given directory in the local filesystem, HDFS
or any other Hadoop-supported file system. Spark will call
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
Note that this property is only available for YARN
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
There's a discussion of this at https://github.com/apache/spark/pull/5403
On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote:
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
Do you build via maven or sbt? How do you submit your application -- do you
use local, standalone or mesos/yarn? Your jars as you originally listed
them seem right to me. Try this, from your ${SPARK_HOME}:
Hi Roni,
These are exposed as public APIs. If you want, you can run them inside of the
adam-shell (which is just a wrapper for the spark shell, but with the ADAM
libraries on the class path).
Also , I need to save all my intermediate data. Seems like ADAM stores data
in Parquet on HDFS.
I
Here is the physical plan.
Also attaching the executor log from one of the executors. You can see that
memory consumption is slowly rising and then it is reaching around 10.5 GB.
There it is staying for around 5 minutes 06-50-36 to 06-55-00. Then this
executor is getting killed. ExecutorMemory
I am profiling the driver. It currently has 564MB of strings which might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it
is still running. What are those long[] used for?
--
View this message in context:
Seems that Spark SQL can't retrieve table size statistics and doesn't
enable broadcast join in your case. Would you please try `ANALYZE TABLE
table-name` for both tables to generated table statistics information?
Cheng
On 6/10/15 10:26 PM, Sourav Mazumder wrote:
Here is the physical plan.
It's always better to use a quasi newton solver if the runtime and problem
scale permits as there are guarantees on opti mization...owlqn and bfgs are
both quasi newton
Most single node code bases will run quasi newton solvesif you are
using sgd better is to use adadelta/adagrad or similar
Thanks Akhil. For posterity, I ended up with:
https://gist.github.com/dokipen/aa07f351a970fe54fcff
I couldn't get rddToFilename() to work, but it's impl was pretty simple.
I'm a poet but I don't know it.
On Tue, Jun 9, 2015 at 3:10 AM Akhil Das ak...@sigmoidanalytics.com wrote:
like
There is C++ component which uses some model which we want to replace it by
spark model output, but there is no C++ API support for reading model, what
is the best way to solve this problem..?
--
View this message in context:
Thank you for responding @nsalian.
1. I am trying to replicate this
https://github.com/dibbhatt/kafka-spark-consumer project on my local
system.
2. Yes, kafka and brokers on the same host.
3. I am working with kafka 0.7.3 and spark 1.3.1. Kafka 0.7.3 does not has
--describe command. Though
Hi,
Can you please little detail stack trace from your receiver logs and also
the consumer settings you used ? I have never tested the consumer with
Kafka 0.7.3 ..not sure if Kafka Version is the issue . Have you tried
building the consumer using Kafka 0.7.3 ?
Regards,
Dibyendu
On Wed, Jun 10,
Or you can do sc.addJar(/path/to/the/jar), i haven't tested with HDFS path
though it works fine with local path.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 10:17 AM, Jörn Franke jornfra...@gmail.com wrote:
I am not sure they work with HDFS pathes. You may want to look at the
source code.
It depends on how big the Batch RDD requiring reloading is
Reloading it for EVERY single DStream RDD would slow down the stream processing
inline with the total time required to reload the Batch RDD …..
But if the Batch RDD is not that big then that might not be an issues
especially in
So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios
kostas.koug...@googlemail.com wrote:
I am profiling the driver. It currently has 564MB of strings which might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so
While it does feel like a filter is what you want to do, a common way to
handle this is to map to different keys.
Using your rddList example it becomes like this (scala style):
---
val rddSplit: RDD[(Int, Any)] = rdd.map(x = (*createKey*(x), x))
val rddBuckets: RDD[(Int, Iterable[Any])] =
After some time the driver accumulated 6.67GB of long[] . The executor mem
usage so far is low.
--
View this message in context:
Hi,
If checkpoint data is already present in HDFS, driver fails to load as it
is performing lookup on previous application directory. As that folder
already exists, it fails to start context.
Failed job's application id was application_1432284018452_0635 and job was
performing lookup on
Hive version 1.x is currently not supported.
Cheers
On Wed, Jun 10, 2015 at 9:16 AM, Neal Yin neal@workday.com wrote:
I am trying to build spark 1.3 branch with Hive 1.1.0.
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -Phive-0.13.1 -Dhive.version=1.1.0
I did not change driver program. I just shutdown the context and again
started.
BTW, I see this ticket already open in unassigned state - SPARK-6892
https://issues.apache.org/jira/browse/SPARK-6892 that talks about this
issue.
Is this a known issue?
Also, any workarounds?
On Wed, Jun 10,
Delete the checkpoint directory, you might have modified your driver
program.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 9:44 PM, Ashish Nigam ashnigamt...@gmail.com
wrote:
Hi,
If checkpoint data is already present in HDFS, driver fails to load as it
is performing lookup on previous
I am trying to build spark 1.3 branch with Hive 1.1.0.
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
-Phive-0.13.1 -Dhive.version=1.1.0 -Dhive.version.short=1.1.0 -DskipTests clean
package
I got following error
Failed to execute goal on project spark-hive_2.10:
No, but you can write a couple lines of code that do this. It's not
optimized of course. This is actually a long and interesting side
discussion, but I'm not sure how much it could be given that the
computation is pull rather than push; there is no concept of one
pass over the data resulting in
Hi nsalian,
For some reason the rest of this thread isn't showing up here. The
NodeManager isn't busy. I'll copy/paste, the details are in there.
I've tried running a Hadoop app pointing to the same queue. Same
Dear List,
I'm trying to reference a lonely message to this list from March 25th,(
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Maven-Test-error-td22216.html
), but I'm unsure this will thread properly. Sorry, if didn't work out.
Anyway, using Spark 1.4.0-RC4 I run into the same
On YARN, there is no concept of a Spark Worker. Multiple executors will be
run per node without any effort required by the user, as long as all the
executors fit within each node's resource limits.
-Sandy
On Wed, Jun 10, 2015 at 3:24 PM, Evo Eftimov evo.efti...@isecc.com wrote:
Yes i think
All,
I was wondering if any of you have solved this problem :
I have pyspark(ipython mode) running on docker talking to
a yarn cluster(AM/executors are NOT running on docker).
When I start pyspark in the docker container, it binds to port *49460.*
Once the app is submitted to YARN, the app(AM)
Yes i think it is ONE worker ONE executor as executor is nothing but jvm
instance spawned by the worker
To run more executors ie jvm instances on the same physical cluster node you
need to run more than one worker on that node and then allocate only part of
the sys resourced to that
Hi,
I am a Spark newbie, and trying to solve the same problem, and have
implemented the same exact solution that sowen is suggesting. I am using
priorityqueues to keep trak of the top 25 sub_categories, per each category,
and using the combineByKey function to do that.
However I run into the
Hi,
Thanks for the added information. Helps add more context.
Is that specific queue different from the others?
FairScheduler.xml should have the information needed.Or if you have a
separate allocations.xml.
Something of this format:
allocations
queue name=sample_queue
minResources1
Thanks for your help !
Switching to HiveContext fixed the issue.
Just one side comment:
In the documentation regarding Hive Tables and HiveContext
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables,
we see:
// sc is an existing JavaSparkContext.HiveContext sqlContext =
I have tried both cases(s3 and s3n, set all possible parameters), and trust me,
the same code works with 1.3.1, but not for 1.3.0 and 1.4.0, 1.5.0.
I even use a plain project to test this, and use maven to include all
referenced library, but it give me error.
I think everyone can easily
Thanks for pointing out the documentation error :) Opened
https://github.com/apache/spark/pull/6749 to fix this.
On 6/11/15 1:18 AM, James Pirz wrote:
Thanks for your help !
Switching to HiveContext fixed the issue.
Just one side comment:
In the documentation regarding Hive Tables and
Actually this is somehow confusing for two reasons:
- First, the option 'spark.executor.instances', which seems to be only dealt
with in the case of YARN in the source code of SparkSubmit.scala, is also
present in the conf/spark-env.sh file under the standalone section, which
would indicate that
What is the best way to reuse hive custom transform scripts written in python
or awk or c++ which process data from stdin and print to stdout in spark.
These scripts are typically using the Transform Syntax in Hive
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
--
This PR adds support for multiple executors per worker:
https://github.com/apache/spark/pull/731 and should be available in 1.4.
Thanks,
Nishkam
On Wed, Jun 10, 2015 at 1:35 PM, Evo Eftimov evo.efti...@isecc.com wrote:
We/i were discussing STANDALONE mode, besides maxdml had already
Launching using spark-ec2 script results in:
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...
...
Shutting down GANGLIA gmond: [FAILED]
Starting GANGLIA gmond:[ OK ]
Shutting down GANGLIA gmond:
you need to register using spark-default.xml as explained here
In many cases the shuffle will actually hit the OS buffer cache and
not ever touch spinning disk if it is a size that is less than memory
on the machine.
- Patrick
On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote:
So with this... to help my understanding of Spark under the
So with this... to help my understanding of Spark under the hood-
Is this statement correct When data needs to pass between multiple JVMs, a
shuffle will *always* hit disk?
On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com wrote:
There's a discussion of this at
Thanks much for the detailed explanations. I suspected architectural
support of the notion of rdd of rdds, but my understanding of Spark or
distributed computing in general is not as deep as allowing me to
understand better. so this really helps!
I ended up going with List[RDD]. The collection of
Hello,
I am using 1.4.0 and found the following weird behavior.
This case works fine:
scala sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn(index,
rand(30)).show()
+--+---+---+
|_1| _2| index|
+--+---+---+
| 1| 2| 0.6662967911724369|
|
You may compare the c:\windows\system32\drivers\etc\hosts if they are
configured similarly
Le mer. 10 juin 2015 à 17:16, Eran Medan eran.me...@gmail.com a écrit :
I'm on a road block trying to understand why Spark doesn't work for a
colleague of mine on his Windows 7 laptop.
I have pretty
I don't think it's propagated automatically. Try this:
spark-submit --conf spark.executorEnv.PYTHONPATH=... ...
On Wed, Jun 10, 2015 at 8:15 AM, Bob Corsaro rcors...@gmail.com wrote:
I'm setting PYTHONPATH before calling pyspark, but the worker nodes aren't
inheriting it. I've tried looking
Looks like libphp version is 5.6 now, which version of spark are you using?
Thanks
Best Regards
On Thu, Jun 11, 2015 at 3:46 AM, barmaley o...@solver.com wrote:
Launching using spark-ec2 script results in:
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...
...
Shutting down GANGLIA
This might help
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_installing-kerb-spark-quickstart.html
Thanks
Best Regards
On Wed, Jun 10, 2015 at 6:49 PM, kazeborja kazebo...@gmail.com wrote:
Hello all.
I've been reading some old mails and
Ok so it is the case that small shuffles can be done without hitting any
disk. Is this the same case for the aux shuffle service in yarn? Can that
be done without hitting disk?
On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote:
In many cases the shuffle will actually hit
I'm on a road block trying to understand why Spark doesn't work for a
colleague of mine on his Windows 7 laptop.
I have pretty much the same setup and everything works fine.
I googled the error message and didn't get anything that resovled it.
Here is the exception message (after running spark
I'm setting PYTHONPATH before calling pyspark, but the worker nodes aren't
inheriting it. I've tried looking through the code and it appears that it
should, I can't find the bug. Here's an example, what am I doing wrong?
https://gist.github.com/dokipen/84c4e4a89fddf702fdf1
85 matches
Mail list logo