Hi all
I am unable to access s3n:// urls using sc.textFile().. getting 'no file
system for scheme s3n://' error.
a bug or some conf settings missing?
See below for details:
env variables :
AWS_SECRET_ACCESS_KEY=set
AWS_ACCESS_KEY_ID=set
spark/RELAESE :
Spark 1.3.1 (git revision 908a0bf)
Below is my code to access s3n without problem (only for 1.3.1. there is a bug
in 1.3.0).
Configuration hadoopConf = ctx.hadoopConfiguration();
hadoopConf.set(fs.s3n.impl,
org.apache.hadoop.fs.s3native.NativeS3FileSystem);
This thread from hadoop mailing list should give you some clue:
http://search-hadoop.com/m/LgpTk2df7822
On Wed, Apr 22, 2015 at 9:45 AM, Sujee Maniyam su...@sujee.net wrote:
Hi all
I am unable to access s3n:// urls using sc.textFile().. getting 'no
file system for scheme s3n://' error.
I have two RDDs each saved in a parquet file. I need to join this two RDDs by
the id column. Can I created index on the id column so they can join faster?
Here is the code
case class Example(val id: String, val category: String)
case class DocVector(val id: String, val vector: Vector)
val
I tried the solution of the guide, but I exceded the size of case class Row:
2015-04-22 15:22 GMT+02:00 Tathagata Das tathagata.das1...@gmail.com:
Did you checkout the latest streaming programming guide?
does anybody have any thought on this?
On 21 April 2015 at 20:57, Jeetendra Gangele gangele...@gmail.com wrote:
The problem with k means is we have to define the no of cluster which I
dont want in this case
So thinking for something like hierarchical clustering any idea and
suggestions?
Did you checkout the latest streaming programming guide?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
You also need to be aware of that to convert json RDDs to dataframe,
sqlContext has to make a pass on the data to learn the schema. This will
One way is to use export SPARK_PREPEND_CLASSES=true. This will instruct the
launcher to prepend the target directories for each project to the spark
assembly. I’ve had mixed experiences with it lately, but in principle
that's the only way I know.
On Wed, Apr 22, 2015 at 3:42 PM, zia_kayani
This thread seems related:
http://search-hadoop.com/m/JW1q51W02V
Cheers
On Wed, Apr 22, 2015 at 6:09 AM, James King jakwebin...@gmail.com wrote:
What's the best way to start-up a spark job as part of starting-up the
Spark cluster.
I have an single uber jar for my job and want to make the
You can run multiple Spark clusters against one ZK cluster. Just use this
config to set independent ZK roots for each cluster:
spark.deploy.zookeeper.dir
The directory in ZooKeeper to store recovery state (default: /spark).
-Jeff
From: Sean Owen so...@cloudera.com
To: Akhil Das
In master branch, overhead is now 10%.
That would be 500 MB
FYI
On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:
+1 to executor-memory to 5g.
Do check the overhead space for both the driver and the executor as per
Wilfred's suggestion.
Typically, 384 MB should
Is there a good resource that covers what kind of chatter (communication)
that goes on between driver, master and worker processes?
Thanks
Hi all,
I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really
confused me today. At first I thought my implementation is wrong. It turns
out it's an issue in MLlib. Fortunately, I've figured it out.
I suggest to add a hint on user document of MLlib ( as far as I know,
Xudong and Rex,
Can you try 1.3.1? With PR 5339 http://github.com/apache/spark/pull/5339 ,
after we get a hive parquet from metastore and convert it to our native
parquet code path, we will cache the converted relation. For now, the first
access to that hive parquet table reads all of the footers
+1 to executor-memory to 5g.
Do check the overhead space for both the driver and the executor as per
Wilfred's suggestion.
Typically, 384 MB should suffice.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
Sent
nice, thanks for the information.
Thanks
Best Regards
On Wed, Apr 22, 2015 at 8:53 PM, Jeff Nadler jnad...@srcginc.com wrote:
You can run multiple Spark clusters against one ZK cluster. Just use
this config to set independent ZK roots for each cluster:
spark.deploy.zookeeper.dir
Does it still hit the memory limit for the container?
An expensive transformation?
On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu yuzhih...@gmail.com wrote:
In master branch, overhead is now 10%.
That would be 500 MB
FYI
On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:
Hi Everyone,
I have two options of filtering the RDD resulting from the Graph.vertices
method as illustrated with the following pseudo code:
1. Filter
val vertexSet = Set(vertexOne,vertexTwo...);
val filteredVertices = Graph.vertices.filter(x =
vertexSet.contains(x._2.vertexName))
2. Join
Afternoon all,
I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via:
`mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package`
The error is encountered when running spark shell via:
`spark-shell --packages com.databricks:spark-csv_2.11:1.0.3`
The full trace of the
Is the mylist present on every executor? If not, then you have to pass it
on. And broadcasts are the best way to pass them on. But note that once
broadcasted it will immutable at the executors, and if you update the list
at the driver, you will have to broadcast it again.
TD
On Wed, Apr 22, 2015
Sure thanks. if you can guide me how to do this will be great help.
On 17 April 2015 at 22:05, Ted Yu yuzhih...@gmail.com wrote:
I have some assignments on hand at the moment.
Will try to come up with sample code after I clear the assignments.
FYI
On Thu, Apr 16, 2015 at 2:00 PM,
Hi,
I'm running a unit test that keeps failing to work with the code I wrote in
Spark.
Here is the output logs from my test that I ran that gets the customers from
incoming events in the JSON called QueueEvent and I am trying to convert the
incoming events for each customer to an alert.
From
Furthermore, just to explain, doing arr.par.foreach does not help because
it not really running anything, it only doing setup of the computation.
Doing the setup in parallel does not mean that the jobs will be done
concurrently.
Also, from your code it seems like your pairs of dstreams dont
Can I use broadcast vars in local mode?
ᐧ
On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com wrote:
Yep. Not efficient. Pretty bad actually. That's why broadcast variable
were introduced right at the very beginning of Spark.
On Wed, Apr 22, 2015 at 10:58 AM, Vadim
Absolutely. The same code would work for local as well as distributed mode!
On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
Can I use broadcast vars in local mode?
ᐧ
On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com
wrote:
Yep. Not
Yep. Not efficient. Pretty bad actually. That's why broadcast variable were
introduced right at the very beginning of Spark.
On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
Thanks TD. I was looking into broadcast variables.
Right now I am running it
will you be able to paste the code?
On 23 April 2015 at 00:19, Adrian Mocanu amoc...@verticalscope.com wrote:
Hi
I use the ElasticSearch package for Spark and very often it times out
reading data from ES into an RDD.
How can I keep the connection alive (why doesn't it? Bug?)
Here's
Hi
The gist of it is this:
I have data indexed into ES. Each index stores monthly data and the query will
get data for some date range (across several ES indexes or within 1 if the date
range is within 1 month). Then I merge these RDDs into an uberRdd and performs
some operations then print the
Basically ready timeout means hat no data arrived within the specified
receive timeout period.
Few thing I would suggest
1.are your ES cluster Up and running?
2. if 1 is yes then reduce the size of the Index make it few kbps and then
test?
On 23 April 2015 at 00:19, Adrian Mocanu
It could very well be that your executor memory is not enough to store the
state RDDs AND operate on the data. 1G per executor is quite low.
Definitely give more memory. And have you tried increasing the number of
partitions (specify number of partitions in updateStateByKey) ?
On Wed, Apr 22,
Hi
I use the ElasticSearch package for Spark and very often it times out reading
data from ES into an RDD.
How can I keep the connection alive (why doesn't it? Bug?)
Here's the exception I get:
org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
I ended up post-processing the result in hive with a dynamic partition
insert query to get a table partitioned by the key.
Looking further, it seems that 'dynamic partition' insert is in Spark SQL
and working well in Spark SQL versions 1.2.0:
https://issues.apache.org/jira/browse/SPARK-3007
On
Thanks all...
btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop
2.4
I tried this on 1.3.1-hadoop26
sc.hadoopConfiguration.set(fs.s3n.impl,
org.apache.hadoop.fs.s3native.NativeS3FileSystem)
val f = sc.textFile(s3n://bucket/file)
f.count
No it can't find the
Hi!
I’m trying to query a dataset that reads data from csv and provides a SQL on
top of it. The problem I have is I have a hierarchy of objects that I need to
represent as a table so that users might use SQL to query it and do some
aggregations. I do have multi value attributes (that in csv
Yes. Fair schedulwr only helps concurrency within an application. With
multiple shells you'd either need something like Yarn/Mesos or careful math on
resources as you said
Sent on the new Sprint Network from my Samsung Galaxy S®4.
div Original message /divdivFrom: Arun Patel
Hi
Thanks for the help. My ES is up.
Out of curiosity, do you know what the timeout value is? There are probably
other things happening to cause the timeout; I don't think my ES is that slow
but it's possible that ES is taking too long to find the data. What I see
happening is that it uses
I have now a fair understanding of the situation after looking at javap
output. So as a reminder:
dstream.map(tuple = {
val w = StreamState.fetch[K,W](state.prefixKey, tuple._1)
(tuple._1, (tuple._2, w)) })
And StreamState being a very simple standalone object:
object StreamState {
def
Can you give full code? especially the myfunc?
On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
Here's what I did:
print 'BROADCASTING...'
broadcastVar = sc.broadcast(mylist)
print broadcastVar
print broadcastVar.value
print 'FINISHED BROADCASTING...'
Hi,
If you get ES response back in 1-5 seconds that's pretty slow. Are these
ES aggregation queries? Costin may be right about GC possibly causing
timeouts. SPM http://sematext.com/spm/ can give you all Spark and all
key Elasticsearch metrics, including various JVM metrics. If the problem
is
Hi, would you please how to checkpoint the training set rdd since all things
are done in ALS.train method.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StackOverflow-Error-when-run-ALS-with-100-iterations-tp4296p22619.html
Sent from the Apache Spark User
I notice that databricks provides external datasource api for Spark SQL. But I
can't find any reference documents to guide how to access HBase based on it
directly.
Who know it? Or please give me some related links. Thanks.
ZhangYi (张逸)
BigEye
website:
Yin,
Thanks for your reply.
We already patched this PR to our 1.3.0
As Xudong mentioned, we have thousand of parquet files, it's very very slow
in first read, and another app will add more files and refresh table
regularly.
Cheng Lian's PR-5334 seems can resolve this issue, it will skip read all
You may need to specify the hive port itself. For example, my own Thrift
start command is in the form:
./sbin/start-thriftserver.sh --master spark://$myserver:7077
--driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host
$myserver --hiveconf hive.server2.thrift.port 1
HTH!
Hi,
First off, for Elasticsearch questions is worth pinging the Elastic mailing
list as that is closer monitored than this one.
Back to your question, Jeetendra is right that the exception indicates nodata
is flowing back to the es-connector and
Spark.
The default is 1m [1] which should be
Sorry for the confusion. We should be more clear about the semantics in
the documentation. (PRs welcome :) )
.saveAsTable does not create a hive table, but instead creates a Spark Data
Source table. Here the metadata is persisted into Hive, but hive cannot
read the tables (as this API support
I would like to easily launch a cluster that supports s3a file systems.
if I launch a cluster with `spark-ec2 --hadoop-major-version=2`,
what determines the minor version of hadoop?
Does it depend on the spark version being launched?
Are there other allowed values for --hadoop-major-version
Hi,
Just a quick question,
Regarding the source code of groupByKey:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L453
In the end, it cast CompactBuffer to Iterable. But why ? Any advantage?
Thank you.
Hao.
--
View this message
Is there a way to set the cost value C when using linear SVM?
Hello,
I am using Spark 1.3 for SparkSQL (hive) ThriftServer Beeline.
The Beeline doesn't work with --hiveconf or ''--hivevar which substitutes
variables at hive scripts.
I found the following jiras saying that Hive 0.13 resolved that issue.
I wonder if this is well-known issue?
Vaguely makes sense. :) Wow that's an interesting corner case.
On Wed, Apr 22, 2015 at 1:57 PM, Jean-Pascal Billaud j...@tellapart.com
wrote:
I have now a fair understanding of the situation after looking at javap
output. So as a reminder:
dstream.map(tuple = {
val w =
Is your ES cluster reachable from your Spark cluster via network / firewall?
Can you run the same query from the spark master and slave nodes via curl / one
of the other clients?
Seems odd that GC issues would be a problem from the scan but not when running
query from a browser plugin...
I am trying to read a few hundred .parquet files from S3 into an EMR cluster.
The .parquet files are structured by date and have /_common_metadata/ in
each of the folders (as well as /_metadata/).The *sqlContext.parquetFile*
operation takes a very long time, opening for reading each of the
Hi:
there is a little error in source code LDA.scala at line 180, as
follows:
def setBeta(beta: Double): this.type = setBeta(beta)
which cause java.lang.StackOverflowError. It's easy to see there is
error
--
View this message in context:
This is the size of the serialized task closure. Is stage 246 part of
ALS iterations, or something before or after it? -Xiangrui
On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone
christian.per...@gmail.com wrote:
Hi Sean, thanks for the answer. I tried to call repartition() on the input
I submit a spark app to YARN and i get these messages
15/04/22 22:45:04 INFO yarn.Client: Application report for
application_1429087638744_101363 (state: RUNNING)
15/04/22 22:45:04 INFO yarn.Client: Application report for
application_1429087638744_101363 (state: RUNNING).
...
1) I can go to
The patched was merged and it will be included in 1.3.2 and 1.4.0.
Thanks for reporting the bug! -Xiangrui
On Tue, Apr 21, 2015 at 2:51 PM, ayan guha guha.a...@gmail.com wrote:
Thank you all.
On 22 Apr 2015 04:29, Xiangrui Meng men...@gmail.com wrote:
SchemaRDD subclasses RDD in 1.2, but
Please try reducing the step size. The native BLAS library is not
required. -Xiangrui
On Tue, Apr 21, 2015 at 5:15 AM, Staffan staffan.arvids...@gmail.com wrote:
Hi,
I've written an application that performs some machine learning on some
data. I've validated that the data _should_ give a good
We store the vectors on the driver node. So it is hard to handle a
really large vocabulary. You can use setMinCount to filter out
infrequent word to reduce the model size. -Xiangrui
On Wed, Apr 22, 2015 at 12:32 AM, gm yu husty...@gmail.com wrote:
When use Mllib.Word2Vec, I meet the following
Having ordered indices is a contract of SparseVector:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector.
We do not verify it for performance. -Xiangrui
On Wed, Apr 22, 2015 at 8:26 AM, yaochunnan yaochun...@gmail.com wrote:
Hi all,
I am using
Thanks! That's a bug .. -Xiangrui
On Wed, Apr 22, 2015 at 9:34 PM, buring qyqb...@gmail.com wrote:
Hi:
there is a little error in source code LDA.scala at line 180, as
follows:
def setBeta(beta: Double): this.type = setBeta(beta)
which cause java.lang.StackOverflowError. It's easy
What is the feature dimension? Did you set the driver memory? -Xiangrui
On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com wrote:
I'm trying to use the StandardScaler in pyspark on a relatively small (a few
hundred Mb) dataset of sparse vectors with 800k features. The fit method of
Test it out, but I would be willing to bet the join is going to be a good
deal faster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-filter-vs-RDD-join-advice-please-tp22612p22614.html
Sent from the Apache Spark User List mailing list archive at
YARN capacity scheduler support hierarchical queues, which you can assign
cluster resource as percentage. Your spark application/shell can be
submitted to different queues. Mesos supports fine-grained mode, which
allows the machines/cores used each executors ramp up and down.
Lan
On Wed, Apr 22,
Here's what I did:
print 'BROADCASTING...'
broadcastVar = sc.broadcast(mylist)
print broadcastVar
print broadcastVar.value
print 'FINISHED BROADCASTING...'
The above works fine,
but when I call myrdd.map(myfunc) I get *NameError: global name
'broadcastVar' is not defined*
The myfunc function
https://github.com/databricks/spark-avro
On Tue, Apr 21, 2015 at 3:09 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples
You can try pulling the jar with wget and using it with -jars with Spark shell.
I used 1.0.3 with Spark 1.3.0 but with a different version of scala. From the
stack trace it looks like Spark shell is just not seeing the csv jar...
Sent on the new Sprint Network from my Samsung Galaxy S®4.
DataFrames do not have the attributes 'alias' or 'as' in the Python API.
On 2015-04-21 20:41, Michael Armbrust wrote:
This is https://issues.apache.org/jira/browse/SPARK-6231
Unfortunately this is pretty hard to fix as its hard for us to
differentiate these without aliases. However you can
Hi,
I am working on adding Geometry i.e. a new DataType into Spark catalyst, so
that ROW can hold that object also, I've made a progress but its time taking
as I've to compile the whole spark project, otherwise that changes aren't
visible, I've tried to just build Spark SQL and Catalyst module
When use Mllib.Word2Vec, I meet the following error:
allocating large
array--thread_id[0x7ff2741ca000]--thread_name[Driver]--array_size[1146093680
bytes]--array_length[1146093656 elememts]
prio=10 tid=0x7ff2741ca000 nid=0x1405f runnable
at
I see, now try a bit tricky approach, Add the hive jar to the
SPARK_CLASSPATH (in conf/spark-env.sh file on all machines) and make sure
that jar is available on all the machines in the cluster in the same path.
Thanks
Best Regards
On Wed, Apr 22, 2015 at 11:24 AM, Manku Timma
Hi Ankur,
Thanks for the answer. However i still have following queries.
On Wed, Apr 22, 2015 at 8:39 AM, Ankur Dave ankurd...@gmail.com wrote:
On Tue, Apr 21, 2015 at 10:39 AM, mas mas.ha...@gmail.com wrote:
How does GraphX stores the routing table? Is it stored on the master node
or
You can enable this flag to run multiple jobs concurrently, It might not be
production ready, but you can give it a try:
sc.set(spark.streaming.concurrentJobs,2)
Refer to TD's answer here
I wrote few mails here regarding this issue.
After further investigation I think there is a bug in Spark 1.3 in saving
Hive tables.
(hc is HiveContext)
1. Verify the needed configuration exists:
scala hc.sql(set hive.exec.compress.output).collect
res4: Array[org.apache.spark.sql.Row] =
Hi,
Looks interesting.
It is quite interesting to know about what could have been the reason for
not showing these stats in UI.
As per the description of Patrick W in
https://spark-project.atlassian.net/browse/SPARK-999, it does not mention
any exception w.r.t failed tasks/executors.
Can
the key thing would be to use different ZK paths for each cluster. You
shouldn't need more than 2 ZK quorums even for a large (few thousand node)
Hadoop clusters: one for the HA bits of the infrastructure (HDFS, YARN) and one
for the applications to abuse. It's easy for apps using ZK to stick
I believe we can use the properties like --executor-memory
--total-executor-cores to configure the resources allocated for each
application. But, in a multi user environment, shells and applications are
being submitted by multiple users at the same time. All users are
requesting resources with
Anyone?
On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra
sourav.chan...@livestream.com wrote:
Hi Olivier,
*the update function is as below*:
*val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
Long)]) = {*
* val previousCount = state.getOrElse((0L, 0L))._2*
*
Not that i've tried it, but, why couldn't you use one ZK server? I
don't see a reason.
On Wed, Apr 22, 2015 at 7:40 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
It isn't mentioned anywhere in the doc, but you will probably need separate
ZK for each of your HA cluster.
Thanks
Best Regards
Hi,
I want to write this in Spark SQL DSL:
select map('c1', c1, 'c2', c2) as m
from table
Is there a way?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
Hi,
I am creating a spark RDD through accumulo writing like:
JavaPairRDDKey, Value accumuloRDD =
sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),AccumuloInputFormat.class,Key.class,
Value.class);
But I am getting the following error and it is not getting compiled:
Bound mismatch: The
Anyone ?
On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu marius.dan...@gmail.com
wrote:
Hello anyone,
I have a question regarding the sort shuffle. Roughly I'm doing something
like:
rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2)
The problem is that in f2 I don't see the
81 matches
Mail list logo