Pat, columnSimilarities is what that blog post is about, and is already
part of Spark 1.2.
rowSimilarities in a RowMatrix is a little more tricky because you can't
transpose a RowMatrix easily, and is being tracked by this JIRA:
https://issues.apache.org/jira/browse/SPARK-4823
Andrew, sometimes
1) The fields in the SELECT clause are not pushed down to the predicate
pushdown API. I have many optimizations that allow fields to be filtered
out before the resulting object is serialized on the Accumulo tablet
server. How can I get the selection information from the execution plan?
I'm a
Thanks Akhil and Sean for the responses.
I will try shutting down spark, then storage and then the instances.
Initially, when hdfs was in safe mode, I waited for 1 hour and the problem
still persisted. I will try this new method.
Thanks!
On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen
I see now. It optimizes the selection semantics so that less things need to
be included just to do a count(). Very nice. I did a collect() instead of a
count just to see what would happen and it looks like the all the expected
select fields were propagated down as expected. Thanks.
On Sat,
Originally posted here:
http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script
I'm trying to launch a standalone Spark cluster using its pre-packaged EC2
scripts, but it just indefinitely hangs in an 'ssh-ready' state:
Can't you send a special event through spark streaming once the list is
updated? So you have your normal events and a special reload event
Le 17 janv. 2015 15:06, Ji ZHANG zhangj...@gmail.com a écrit :
Hi,
I want to join a DStream with some other dataset, e.g. join a click
stream with a spam
I suspect that putting a function into shared variable incurs additional
overhead? Any suggestion how to avoid that?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-tp21194p21210.html
Sent from the Apache Spark User List mailing list
Yeah okay, thanks.
On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com wrote:
Pat, columnSimilarities is what that blog post is about, and is already part
of Spark 1.2.
rowSimilarities in a RowMatrix is a little more tricky because you can't
transpose a RowMatrix easily, and
Hi,
When I run this:
dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
as per here
https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211,
maven doesn't build Spark's dependencies.
Only when I run:
Hi,
This is because ssh-ready is the ec2 scripy means that all the instances
are in the status of running and all the instances in the status of OK,
In another word, the instances is ready to download and to install
software, just as emr is ready for bootstrap actions.
Before, the script just
Hi,
My spark jobs suddenly started getting hung and here is the debug leading
to it:
Following the program, it seems to be stuck whenever I do any collect(),
count or rdd.saveAsParquet file. AFAIK, any operation that requires data
flow back to master causes this. I increased the memory to 5 MB.
Hello Users,
I've got a real-world use case that seems common enough that its pattern would
be documented somewhere, but I can't find any references to a simple solution.
The challenge is that data is getting dumped into a directory structure, and
that directory structure itself contains
BTW it looks like row and column similarities (cosine based) are coming to
MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the
master yet. Does anyone know the status?
See:
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
Michael,
What I'm seeing (in Spark 1.2.0) is that the required columns being pushed
down to the DataRelation are not the product of the SELECT clause but
rather just the columns explicitly included in the WHERE clause.
Examples from my testing:
SELECT * FROM myTable -- The required columns are
In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a
transpose—it’s optimized out. Mahout has had a stand alone row matrix transpose
since day 1 and supports it in the Spark version. Can’t really do matrix
algebra without it even though it’s often possible to optimize it away.
How are you running your test here? Are you perhaps doing a .count()?
On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet cjno...@gmail.com wrote:
Michael,
What I'm seeing (in Spark 1.2.0) is that the required columns being pushed
down to the DataRelation are not the product of the SELECT clause
Akhil,
Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.
- Patrick
On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
My mails to the mailing list are getting rejected, have opened a
I'm deploying Spark using the Click to Deploy Hadoop - Install Apache
Spark on Google Compute Engine.
I can run Spark jobs on the REPL and read data from Google storage.
However, I'm not sure how to access the Spark UI in this deployment. Can
anyone help?
Also, it deploys Spark 1.1. It there an
There're 3 jars under lib_managed/jars directory with and
without -Dscala-2.11 flag.
Difference between scala-2.10 and scala-2.11 profiles is that scala-2.10
profile has the following:
modules
moduleexternal/kafka/module
/modules
FYI
On Sat, Jan 17, 2015 at 4:07 PM, Ted Yu
Yep. They have sorted it out it seems.
On 18 Jan 2015 03:58, Patrick Wendell pwend...@gmail.com wrote:
Akhil,
Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.
- Patrick
On Sat, Jan 17, 2015 at 12:56 AM,
I did the following:
1655 dev/change-version-to-2.11.sh
1657 mvn -DHADOOP_PROFILE=hadoop-2.4 -Pyarn,hive -Phadoop-2.4
-Dscala-2.11 -DskipTests clean package
And mvn command passed.
Did you see any cross-compilation errors ?
Cheers
BTW the two links you mentioned are consistent in terms of
Hi,
Driver programs submitted by the spark-submit script will get the runtime spark
master URL, but how it get the URL inside the main method when creating the
SparkConf object?
Regards,
Makes sense.
On Jan 17, 2015, at 6:27 PM, Reza Zadeh r...@databricks.com wrote:
We're focused on providing block matrices, which makes transposition simple:
https://issues.apache.org/jira/browse/SPARK-3434
On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com wrote:
In the
We're focused on providing block matrices, which makes transposition
simple: https://issues.apache.org/jira/browse/SPARK-3434
On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com wrote:
In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a
transpose—it’s optimized
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest
asking in the GCE support forum. You could also try to launch a Spark cluster
by hand on nodes in there. Sigmoid Analytics published a package for this here:
http://spark-packages.org/package/9
Matei
On Jan 17,
I'm new to Spark. From my experience when I use asingle StreamingContext to
create different input streams from different sources I get multiple errors
and problems down stream. This seems like it is not the way to go. From what
I read creating multiple StreamingContext is not advised. It appears
Hi I am trying to run simple count on a s3 bucket, but with spark 1.2.0
version on EC2 it is not able to run.
I started my cluster using ec2 script that came with spark 1.2.0.
some part of code :
It is working with spark 1.1.1 , but not with 1.2.0
-
Software Developer
I'm getting this also, with Scala 2.11 and Scala 2.10:
15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/01/18 07:34:51 INFO Remoting: Starting remoting
15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread
I'm new to Spark and have run into issues using Kryo for serialization
instead of Java. I have my SparkConf configured as such:
val conf = new SparkConf().setMaster(local).setAppName(test)
.set(spark.kryo.registrationRequired,false)
.set(spark.serializer,
Hi all!
I found this problem when I tried running python application on Amazon's EMR
yarn cluster.
It is possible to run bundled example applications on EMR but I cannot
figure out how to run a little bit more complex python application which
depends on some other python scripts. I tried adding
I’ve read these pages. In the paper GraphX: Graph Processing in a Distributed
Dataflow Framework
“, the authors claim that it only takes 400 seconds for uk-2007-05 dataset,
which is similar size as my dateset. Is the current Graphx the same version as
the Graphx in that paper? And how many
My mails to the mailing list are getting rejected, have opened a Jira
issue, can someone take a look at it?
https://issues.apache.org/jira/browse/INFRA-9032
Thanks
Best Regards
What is the data size? Have you tried increasing the driver memory??
Thanks
Best Regards
On Sat, Jan 17, 2015 at 1:01 PM, Kevin (Sangwoo) Kim kevin...@apache.org
wrote:
Hi experts,
I got an error during unpersist RDD.
Any ideas?
java.util.concurrent.TimeoutException: Futures timed out
Safest way would be to first shutdown HDFS and then shutdown Spark (call
stop-all.sh would do) and then shutdown the machines.
You can execute the following command to disable safe mode:
*hadoop fs -safemode leave*
Thanks
Best Regards
On Sat, Jan 17, 2015 at 8:31 AM, Su She
Try:
JavaPairDStreamString, String foo = ssc.String, String,
SequenceFileInputFormatfileStream(/sigmoid/foo);
Thanks
Best Regards
On Sat, Jan 17, 2015 at 4:24 AM, Leonidas Fegaras fega...@cse.uta.edu
wrote:
Dear Spark users,
I have a problem using File Streams in Java on Spark 1.2.0. I can
Hi,
I am using Spark-1.0.0 in a single node cluster. When I run a job with
small data set it runs perfectly but when I use a data set of 350 KB, no
output is being produced and when I try to run it the second time it is
giving me an exception telling that SparkContext was shut down.
Can anyone
Try setting the following property:
.set(spark.akka.frameSize,50)
Also make sure that spark is able read from hbase (you can try it with
small amount data).
Thanks
Best Regards
On Fri, Jan 16, 2015 at 11:30 PM, Antony Mayi antonym...@yahoo.com.invalid
wrote:
Hi,
I believe this is some
Wow, glad to know that it works well, and sorry, the Jira is another issue,
which is not the same case here.
From: Bagmeet Behera [mailto:bagme...@gmail.com]
Sent: Saturday, January 17, 2015 12:47 AM
To: Cheng, Hao
Subject: Re: using hiveContext to select a nested Map-data-type from an
Can you paste the code? Also you can try updating your spark version.
Thanks
Best Regards
On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I am using Spark-1.0.0 in a single node cluster. When I run a job with
small data set it runs perfectly but when I use
You would not want to turn off storage underneath Spark. Shut down
Spark first, then storage, then shut down the instances. Reverse the
order when restarting.
HDFS will be in safe mode for a short time after being started before
it becomes writeable. I would first check that it's not just that.
Hello Folks:
I'm running into following error while executing relatively straight
forward spark-streaming code. Am I missing anything?
*Exception in thread main java.lang.AssertionError: assertion failed: No
output streams registered, so nothing to execute*
Code:
val conf = new
You need to trigger some action (stream.print(), stream.foreachRDD,
stream.saveAs*) over the stream that you created for the entire pipeline to
execute.
In your code add the following line:
*unifiedStream.print()*
Thanks
Best Regards
On Sat, Jan 17, 2015 at 3:35 PM, Rohit Pujari
Hi Francois:
I tried using print(kafkaStream) as output operator but no luck. It throws
the same error. Any other thoughts?
Thanks,
Rohit
From: francois.garil...@typesafe.commailto:francois.garil...@typesafe.com
francois.garil...@typesafe.commailto:francois.garil...@typesafe.com
Date:
Not print(kafkaStream), which would just print some String description
of the stream to the console, but kafkaStream.print(), which actually
invokes the print operation on the stream.
On Sat, Jan 17, 2015 at 10:17 AM, Rohit Pujari rpuj...@hortonworks.com wrote:
Hi Francois:
I tried using
I'm not sure how you are setting these values though. Where is
spark.yarn.executor.memoryOverhead=6144 ? Env variables aren't the
best way to set configuration either. Again have a look at
http://spark.apache.org/docs/latest/running-on-yarn.html
... --executor-memory 22g --conf
That was it. Thanks Akhil and Owen for your quick response.
On Sat, Jan 17, 2015 at 4:27 AM, Sean Owen so...@cloudera.com wrote:
Not print(kafkaStream), which would just print some String description
of the stream to the console, but kafkaStream.print(), which actually
invokes the print
Hm, this test hangs for me in IntelliJ. It could be a real problem,
and a combination of a) just recently actually enabling Java tests, b)
recent updates to the complicated Guava shading situation.
The manifestation of the error usually suggests that something totally
failed to start (because of,
Yes. I built spar 1.2 with apache hadoop 2.2. No compatibility issues.
On Sat, Jan 17, 2015 at 4:47 AM, bhavyateja [via Apache Spark User List]
ml-node+s1001560n21197...@n3.nabble.com wrote:
Is spark 1.2 is compatibly with HDP 2.1
--
If you reply to this email,
Antony:
Please check hbase master log to see if there was something noticeable in that
period of time.
If the hbase cluster is not big, check region server log as well.
Cheers
On Jan 16, 2015, at 10:00 AM, Antony Mayi antonym...@yahoo.com.INVALID
wrote:
Hi,
I believe this is some
Hi, guys!
I'm reviving this old question from Nick Chammas with a new proposal: what
do you think about creating a separate Stack Exchange 'Apache Spark' site
(like 'philosophy' and 'English' etc.)?
I'm not sure what would be the best way to deal with user and dev lists,
though - to merge them
When I run the following spark sql example within Idea, I got the
StackOverflowError, lookes like the scala.util.parsing.combinator.Parsers are
calling recursively and infinitely.
Anyone encounters this?
package spark.examples
import org.apache.spark.{SparkContext, SparkConf}
import
Hi,
I'm running Spark 1.2 in yarn-client mode. (using Hadoop 2.6.0)
On VirtualBox, I can run spark-shell --master yarn-client without any
error
However, on a physical machine, I got the following error.
Does anyone know why this happens?
Any help would be appreciated.
Thanks,
Kyounghyun
Hi,
I'm running Spark 1.2 in yarn-client mode. (using Hadoop 2.6.0)
On VirtualBox, I can run spark-shell --master yarn-client without any
error
However, on a physical machine, I got the following error.
Does anyone know why this happens?
Any help would be appreciated.
Thanks,
Kyounghyun
the values are for sure applied as expected - confirmed using the spark UI
environment page...
it comes from my defaults configured using
'spark.yarn.executor.memoryOverhead=8192' (yes, now increased even more) in
/etc/spark/conf/spark-defaults.conf and 'export SPARK_EXECUTOR_MEMORY=24G' in
data size is about 300~400GB, I'm using 800GB cluster and set driver memory
to 50GB.
On Sat Jan 17 2015 at 6:01:46 PM Akhil Das ak...@sigmoidanalytics.com
wrote:
What is the data size? Have you tried increasing the driver memory??
Thanks
Best Regards
On Sat, Jan 17, 2015 at 1:01 PM, Kevin
Hi,
The submitting applications guide in
http://spark.apache.org/docs/latest/submitting-applications.html says:
Alternatively, if your application is submitted from a machine far from the
worker machines (e.g. locally on your laptop), it is common to usecluster mode
to minimize network
People can continue using the stack exchange sites as is with no additional
work from the Spark team. I would not support migrating our mailing lists
yet again to another system like Discourse because I fear fragmentation of
the community between the many sites.
On Sat, Jan 17, 2015 at 6:24 AM,
Hi,
I want to join a DStream with some other dataset, e.g. join a click
stream with a spam ip list. I can think of two possible solutions, one
is use broadcast variable, and the other is use transform operation as
is described in the manual.
But the problem is the spam ip list will be updated
Hi
Did you try using spark 1.2 on hdp 2.1 YARN
Can you please go thru the thread
http://apache-spark-user-list.1001560.n3.nabble.com/Troubleshooting-Spark-tt21189.html
and check where I am going wrong. As my word count program is erroring out
when using spark 1.2 using YARN but its getting
Just wondering if you've made any progress on this -- I'm having the same
issue. My attempts to help myself are documented here
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E
.
I don't believe I have the
The Stack Exchange community will not support creating a whole new site
just for Spark (otherwise you’d see dedicated sites for much larger topics
like “Python”). Their tagging system works well enough to separate
questions about different topics, and the apache-spark
My key/value classes are custom serializable classes. It looks like a
bug. So I filed it on JIRA as SPARK-5297
Thanks
Leonidas
On 01/17/2015 03:07 AM, Akhil Das wrote:
Try:
JavaPairDStreamString, String foo = ssc.String, String,
SequenceFileInputFormatfileStream(/sigmoid/foo);
Hi all,
Thanks for your contribution. We have checked and confirmed that HDP 2.1
YARN don't work with Spark 1.2
On Sat, Jan 17, 2015 at 9:11 AM, bhavya teja potineni
bhavyateja.potin...@gmail.com wrote:
Hi
Did you try using spark 1.2 on hdp 2.1 YARN
Can you please go thru the thread
It worked for me. spark 1.2.0 with hadoop 2.2.0
On Sat, Jan 17, 2015 at 9:39 PM, bhavyateja [via Apache Spark User List]
ml-node+s1001560n21207...@n3.nabble.com wrote:
Hi all,
Thanks for your contribution. We have checked and confirmed that HDP 2.1
YARN don't work with Spark 1.2
On Sat,
Failing for me and another team member on the command line, for what it's worth.
On Jan 17, 2015, at 2:39 AM, Sean Owen so...@cloudera.com wrote:
Hm, this test hangs for me in IntelliJ. It could be a real problem,
and a combination of a) just recently actually enabling Java tests, b)
recent
I did an initial implementation. There are two assumptions i had from the
start that I was very surprised were not a part of the predicate pushdown
API:
1) The fields in the SELECT clause are not pushed down to the predicate
pushdown API. I have many optimizations that allow fields to be filtered
Thanks Reza, interesting approach. I think what I actually want is to
calculate pair-wise distance, on second thought. Is there a pattern for that?
On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote:
You can use K-means with a suitably large k. Each cluster should correspond
Andrew, u would be better off using Mahout's RowSimilarityJob for what u r
trying to accomplish.
1. It does give u pair-wise distances 2. U can specify the Distance measure
u r looking to use 3. There's the old MapReduce impl and the Spark DSL impl
per ur preference.
From: Andrew
Yes it works with 2.2 but we are trying to use spark 1.2 on HDP 2.1
On Sat, Jan 17, 2015, 11:18 AM Chitturi Padma [via Apache Spark User List]
ml-node+s1001560n21208...@n3.nabble.com wrote:
It worked for me. spark 1.2.0 with hadoop 2.2.0
On Sat, Jan 17, 2015 at 9:39 PM, bhavyateja [via
The test passed here:
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1215/consoleFull
It passed locally with the following command:
mvn -DHADOOP_PROFILE=hadoop-2.4 -Phadoop-2.4 -Pyarn -Phive test
-Dtest=JavaAPISuite
FYI
Mahout’s Spark implementation of rowsimilarity is in the Scala
SimilarityAnalysis class. It actually does either row or column similarity but
only supports LLR at present. It does [AA’] for columns or [A’A] for rows first
then calculates the distance (LLR) for non-zero elements. This is a major
Excellent, thanks Pat.
On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com wrote:
Mahout’s Spark implementation of rowsimilarity is in the Scala
SimilarityAnalysis class. It actually does either row or column similarity
but only supports LLR at present. It does [AA’] for
72 matches
Mail list logo