Hi Andrew
Thanks Andrew for your suggestion. I updated the hdfs-site on server side
and also on client side to use hostname instead of IP as mentioned here =
http://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresses/
. Now, I could see that the client is
Any inputs on this will be helpful.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can use JavaPairRDD.saveAsHadoopFile/saveAsNewAPIHadoopFile.
Best Regards,
Shixiong Zhu
2014-06-20 14:22 GMT+08:00 abhiguruvayya sharath.abhis...@gmail.com:
Any inputs on this will be helpful.
--
View this message in context:
my programer runs in standalone model, the commond line is like:
/opt/spark-1.0.0/bin/spark-submit \
--verbose \
--class $class_name --master spark://master:7077 \
--driver-memory 15G \
--driver-cores 2 \
--deploy-mode cluster \
Does JavaPairRDD.saveAsHadoopFile store data as a sequenceFile? Then what is
the significance of RDD.saveAsSequenceFile?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7983.html
Sent from
Hey There,
I'd like to start voting on this release shortly because there are a
few important fixes that have queued up. We're just waiting to fix an
akka issue. I'd guess we'll cut a vote in the next few days.
- Patrick
On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim m...@palantir.com wrote:
Hi
I get it. thank you
On Fri, Jun 20, 2014 at 4:43 PM, Sourav Chandra
sourav.chan...@livestream.com wrote:
From the StreamingContext object, you can get reference of SparkContext
using which you can create broadcast variables
On Fri, Jun 20, 2014 at 2:09 PM, Hahn Jiang
spark-submit has an arguments --num-executors to set the number of
executor, but how could I set it from anywhere else?
We're using Shark, and want to change the number of executor. The number of
executor seems to be same as workers by default?
Shall we configure the executor number manually(Is
Le 20 juin 2014 01:46, Shivani Rao raoshiv...@gmail.com a écrit :
Hello Andrew,
i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am
--num-executors seems to be only available with YARN-only.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-set-the-number-of-executor-tp7990p7992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Yes, learning on a dedicated Spark cluster and predicting inside a Storm
bolt is quite OK :)
Thanks all for your answers.
I'll post back if/when we experience this solution.
E/
2014-06-19 20:45 GMT+02:00 Shuo Xiang shuoxiang...@gmail.com:
If I'm understanding correctly, you want to use
Looking for something like scikit's grid search module.
C
Hi,
I am on Spark 0.9.0
I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
cores in the cluster).
I have an input rdd with 64 partitions.
I am running sc.mapPartitions(...).reduce(...)
I can see that I get full parallelism on the mapper (all my 32 cores are
busy
Hi all,
I'm running a job that seems to continually fail with the following
exception:
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at
This is a planned feature for v1.1. I'm going to work on it after v1.0.1
release. -Xiangrui
On Jun 20, 2014, at 6:46 AM, Charles Earl charles.ce...@gmail.com wrote:
Looking for something like scikit's grid search module.
C
Hi there,
We're trying out Spark and are experiencing some performance issues using
Spark SQL.
Anyone who can tell us if our results are normal?
We are using the Amazon EC2 scripts to create a cluster with 3
workers/executors (m1.large).
Tried both spark 1.0.0 as well as the git master; the
Hello Abhi, I did try that and it did not work
And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So
how did you overcome this problem?
Shivani
On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
Le 20 juin 2014 01:46, Shivani Rao
thanks! i will try that.
i guess what i am most confused about is why the executors are trying to
retrieve the jars directly using the info i provided to add jars to my
spark context. i mean, thats bound to fail no? i could be on a different
machine (so my file://) isnt going to work for them, or
Hi All,
I have a 8 mill row, 500 column data set, which is derived by reading a text
file and doing a filter, flatMap operation to weed out some anomalies.
Now, I have a process which has to run through all 500 columns, do couple of
map, reduce, forEach operations on the data set and return some
Your data source is S3 and data is used twice. m1.large does not have very good
network performance. Please try file.count() and see how fast it goes. -Xiangrui
On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk wrote:
Hi there,
We're trying out Spark and are
Also - you could consider caching your data after the first split (before
the first filter), this will prevent you from retrieving the data from s3
twice.
On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote:
Your data source is S3 and data is used twice. m1.large does not
Hi,
Since I migrated to spark 1.0.0, a couple of applications that used to work in
0.9.1 now fail when broadcasting a variable.
Those applications are run on a YARN cluster in yarn-cluster mode (and used to
run in yarn-standalone mode in 0.9.1)
Here is an extract of the error log:
Exception
Hello Michael,
I have a quick question for you. Can you clarify the statement build fat
JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and
everything needed to run a Job. Can you give an example.
I am using sbt assembly as well to create a fat jar, and supplying the
i noticed that when i submit a job to yarn it mistakenly tries to upload
files to local filesystem instead of hdfs. what could cause this?
in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does
find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local
Hi Koert,
Could you provide more details? Job arguments, log messages, errors, etc.
On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote:
i noticed that when i submit a job to yarn it mistakenly tries to upload
files to local filesystem instead of hdfs. what could cause this?
Yes, it can if you set the output format to SequenceFileOutputFormat. The
difference is saveAsSequenceFile does the conversion to Writable for you if
needed and then calls saveAsHadoopFile.
On Fri, Jun 20, 2014 at 12:43 AM, abhiguruvayya sharath.abhis...@gmail.com
wrote:
Does
In my case it was due to a case class I was defining in the spark-shell and
not being available on the workers. So packaging it in a jar and adding it
with ADD_JARS solved the problem. Note that I don't exactly remember if it
was an out of heap space exception or pergmen space. Make sure your
On Fri, Jun 20, 2014 at 8:22 AM, Koert Kuipers ko...@tresata.com wrote:
thanks! i will try that.
i guess what i am most confused about is why the executors are trying to
retrieve the jars directly using the info i provided to add jars to my spark
context. i mean, thats bound to fail no? i
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the
below issues without running a patched version of Spark:
https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version
conflicts for client applications
https://issues.apache.org/jira/browse/SPARK-2043 --
yeah sure see below. i strongly suspect its something i misconfigured
causing yarn to try to use local filesystem mistakenly.
*
[koert@cdh5-yarn ~]$ /usr/local/lib/spark/bin/spark-submit --class
org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3
Thanks for your suggestions.
file.count() takes 7s, so that doesn't seem to be the problem.
Moreover, a union with the same code/CSV takes about 15s (SELECT * FROM
rooms2 UNION SELECT * FROM rooms3).
The web status page shows that both stages 'count at joins.scala:216' and
'reduce at
Koert, is there any chance that your fs.defaultFS isn't setup right?
On Fri, Jun 20, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote:
yeah sure see below. i strongly suspect its something i misconfigured
causing yarn to try to use local filesystem mistakenly.
*
I'm trying to workaround the StackOverflowError when an object have a long
dependency chain, someone said I should use checkpoint to cuts off
dependencies. I write a sample code to test it, but I can only checkpoint
edges but not vertices. I think I do materialize vertices and edges after
calling
I've tried to parallelize the separate regressions using
allResponses.toParArray.map( x= do logistic regression against labels in x)
But I start to see messages like
14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task
363.0:4)
14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss
Hi Shivani,
I use sbt assembly to create a fat jar .
https://github.com/sbt/sbt-assembly
Example of the sbt file is below.
import AssemblyKeys._ // put this at the top of the file
assemblySettings
mainClass in assembly := Some(FifaSparkStreaming)
name := FifaSparkStreaming
version := 1.0
How about a treeReduceByKey? :-)
On Friday, June 20, 2014 11:55 AM, DB Tsai dbt...@stanford.edu wrote:
Currently, the reduce operation combines the result from mapper
sequentially, so it's O(n).
Xiangrui is working on treeReduce which is O(log(n)). Based on the
benchmark, it dramatically
Hi All,
I was curious to know which of the two approach is better for doing
analytics using spark streaming. Lets say we want to add some metadata to
the stream which is being processed like sentiment, tags etc and then
perform some analytics using these added metadata.
1) Is it ok to make a
Hi,
this is just a follow-up regarding this issue. Turns out that it's caused
by a bug in Spark. I created a case for it:
https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch.
Any chance this could be included in the 1.0.1 release?
Thanks,
- Sebastien
On Tue, Jun 17, 2014
ok solved it. as it happened in spark/conf i also had a file called
core.site.xml (with some tachyone related stuff in it) so thats why it
ignored /etc/hadoop/conf/core-site.xml
On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers ko...@tresata.com wrote:
i put some logging statements in
Dear Spark users,
I have a small 4 node Hadoop cluster. Each node is a VM -- 4 virtual cores, 8GB
memory and 500GB disk. I am currently running Hadoop on it. I would like to run
Spark (in standalone mode) along side Hadoop on the same nodes. Given the
configuration of my nodes, will that work?
The ideal way to do that is to use a cluster manager like Yarn mesos. You
can control how much resources to give to which node etc.
You should be able to run both together in standalone mode however you may
experience varying latency performance in the cluster as both MR spark
demand resources
You are looking to create Shark operators for RDF? Since Shark backend is
shifting to SparkSQL it would be slightly hard but much better effort would
be to shift Gremlin to Spark (though a much beefier one :) )
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Maybe some SPARQL features in Shark, then ?
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Fri, Jun 20, 2014 at 9:45 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
You are looking to create Shark operators for RDF? Since Shark backend is
or a seperate RDD for sparql operations ala SchemaRDD .. operators for
sparql can be defined thr.. not a bad idea :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 20, 2014 at 3:56 PM, andy petrella
for development/testing i think its fine to run them side by side as you
suggested, using spark standalone. just be realistic about what size data
you can load with limited RAM.
On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
The ideal way to do that is to use a
If the metadata is directly related to each individual records, then it can
be done either ways. Since I am not sure how easy or hard will it be for
you add tags before putting the data into spark streaming, its hard to
recommend one method over the other.
However, if the metadata is related to
Hi, just wondering anybody knows how to set up the number of workers (and
the amount of memory) in mesos, while lauching spark-shell? I was trying to
edit conf/spark-env.sh and it looks like that the environment variables are
for YARN of standalone. Thanks!
You should be able to configure in spark context in Spark shell.
spark.cores.max memory.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 20, 2014 at 4:30 PM, Shuo Xiang shuoxiang...@gmail.com wrote:
In short, ADD_JARS will add the jar to your driver classpath and also send
it to the workers (similar to what you are doing when you do sc.addJars).
ex: MASTER=master/url ADD_JARS=/path/to/myJob.jar ./bin/spark-shell
You also have SPARK_CLASSPATH var but it does not distribute the code, it
is
I looks like I was running into
https://issues.apache.org/jira/browse/SPARK-2204
The issues went away when I changed to spark.mesos.coarse.
Kyle
On Fri, Jun 20, 2014 at 10:36 AM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:
I've tried to parallelize the separate regressions using
Hi Meethu,
Are you using Spark 1.0? If so, you should use spark-submit (
http://spark.apache.org/docs/latest/submitting-applications.html), which
has --executor-memory. If you don't want to specify this every time you
submit an application, you can also specify spark.executor.memory in
Folks,
I want to analyse logs and I want to use spark for that. However,
elasticsearch has a fancy frontend in Kibana. Kibana's docs indicate that
it works with elasticsearch only. Is there a similar frontend that can work
with spark?
Mohit.
P.S.: On MapR's spark FAQ I read a statement like
Hi,
Would like to add ourselves to the user list if possible please?
Company: truedash
url: truedash.io
Automatic pulling of all your data in to Spark for enterprise
visualisation, predictive analytics and data exploration at a low cost.
Currently in development with a few clients.
Thanks
Hello Shrikar,
Thanks for your email. I have been using the same workflow as you did. But
my questions was related to creation of the sparkContext. My question was
If I am specifying jars in the java -cp jar-paths, and adding to them
to my build.sbt, do I need to additionally add them in my code
That error typically means that there is a communication error (wrong
ports) between master and worker. Also check if the worker has write
permissions to create the work directory. We were getting this error due
one of the above two reasons
On Tue, Jun 17, 2014 at 10:04 AM, Luis Ángel Vicente
Hi Shivani,
Adding JARs to classpath (e.g. via -cp option) is needed to run your
_local_ Java application, whatever it is. To deliver them to _other
machines_ for execution you need to add them to SparkContext. And you can
do it in 2 different ways:
1. Add them right from your code (your
Hi,
I need to parse a file which is separated by a series of separators. I used
SparkContext.textFile and I met two problems:
1) One of the separators is '\004', which could be recognized by python or R
or Hive, however Spark seems can't recognize this one and returns a symbol
looking like '?'.
I only ran HDFS on the same nodes as Spark and that worked out great
performance and robustness wise. However, I did not run Hadoop itself to
do any computations/jobs on the same nodes. My expectation is that if
you actually ran both at the same time with your configuration, the
performance
58 matches
Mail list logo