Totally depends on your database, if that's a NoSQL database like
MongoDB/HBase etc then you can use the native .saveAsNewAPIHAdoopFile or
.saveAsHadoopDataSet etc.
For a SQL databases, i think people usually puts the overhead on driver
like you did.
Thanks
Best Regards
On Wed, Mar 18, 2015 at
Thanks Khanderao.
On Wed, Mar 18, 2015 at 7:18 PM, Khanderao Kand Gmail
khanderao.k...@gmail.com wrote:
I have used various version of spark (1.0, 1.2.1) without any issues .
Though I have not significantly used kafka with 1.3.0 , a preliminary
testing revealed no issues .
- khanderao
Hi,
I have configured apache spark 1.3.0 with hive 1.0.0 and hadoop 2.6.0.
I am able to create table and retrive data from hive tables via following
commands ,but not able insert data into table.
scala sqlContext.sql(CREATE TABLE IF NOT EXISTS newtable (key INT));
scala sqlContext.sql(select *
Thanks Cheng for replying.
Meant to say to change number of partitions of a cached table. It doesn’t need
to be re-adjusted after caching.
To provide more context:
What I am seeing on my dataset is that we have a large number of tasks. Since
it appears each task is mapped to a partition, I
Hello Everyone,
I am trying to run this MLlib example from Learning Spark:
https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
Things I'm doing differently:
1) Using spark shell instead of an application
2) instead of
Dear Spark experts, I appreciate you can look into my problem and give me
some help and suggestions here... Thank you!
I have a simple Spark application to parse and analyze the log, and I can
run it on my hadoop yarn cluster. The problem with me is that I find it
runs quite slow on the cluster,
Can you see where exactly it is spending time? Like you said it goes to
Stage 2, then you will be able to see how much time it spend on Stage 1.
See if its a GC time, then try increasing the level of parallelism or
repartition it like sc.getDefaultParallelism*3.
Thanks
Best Regards
On Thu, Mar
Do people have a reliable/repeatable method for solving dependency issues
or tips?
The current world of spark-hadoop-hbase-parquet-... is very challenging
given the huge footprint of dependent packages and we may be pushing
against the limits of how many packages can be combined into one
Check this out : https://github.com/cloudant/spark-cloudant. It supports
both the DataFrame and SQL approach for reading data from Cloudant and save
it .
Looking forward to your feedback on the project.
Yang
I get 2 types of error -
-org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0 and
FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407
- discarded
Spar keeps re-trying to submit the code and keeps getting this error.
My file on
Hello,
I have persisted a RDD[T] to disk through saveAsObjectFile. Then I
changed the implementation of T. When I read the file with sc.objectFile
using the new binary, I got the exception of java.io.InvalidClassException,
which is expected.
I try to catch this error via SparkException in the
Hi Akhil,
Thank you for your help. I just found that the problem is related to my
local spark application, since I ran it in IntelliJ and I didn't reload the
project after I recompile the jar via maven. If I didn't reload, it will
use some local cache data to run the application which leads to
Hi all,
I'm trying to run the sample Spark application in version v1.2.0 and above.
However, I've encountered a weird issue like below. This issue only be seen
in v1.2.0 and above, but v1.1.0 and v1.1.1 are fine.
The sample code:
val sc : SparkContext = new SparkContext(conf)
val
Hi Akhil,
1) How could I see how much time it is spending on stage 1? Or what if,
like above, it doesn't get past stage 1?
2) How could I check if its a GC time? and where would I increase the
parallelism for the model? I have a Spark Master and 2 Workers running on
CDH 5.3...what would the
Hey guys,
Not sure if i’m the only one got this. We are building high-available
standalone spark env. We are using ZK with 3 masters in the cluster.
However, in sbin/start-slaves.sh, it calls start-slave.sh for each member in
conf/slaves file, and specify master using $SPARK_MASTER_IP and
JIRA and PR for first issue:
https://issues.apache.org/jira/browse/SPARK-6408
https://github.com/apache/spark/pull/5087
On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where
How did you generate the Hadoop-lzo jar ?
Thanks
On Mar 17, 2015, at 2:36 AM, 唯我者 878223...@qq.com wrote:
hi,everybody:
I have configured the env about LZO like this:
9da01...@a75e774d.bbf50755.jpg
54346...@a75e774d.bbf50755.jpg
But when I execute code with spark-shell
Hi Reza,
Behavior:
· I tried running the job with different thresholds - 0.1, 0.5, 5, 20
100. Every time, the job got stuck at mapPartitionsWithIndex at
RowMatrix.scala:522http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0
with all
Hi,
I am trying to evaluate performance aspects of Spark in respect to
various memory settings. What makes it more difficult is that I'm new to
Python, but the problem at hand doesn't seem to originate from that.
I'm running a wordcount script [1] with different amounts of input data.
There
There might be some delay:
http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view
On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Thanks, Ted. Well, so far even there
To get these metrics out, you need to open the driver ui running on port
4040. And in there you will see Stages information and for each stage you
can see how much time it is spending on GC etc. In your case, the
parallelism seems 4, the more # of parallelism the more # of tasks you will
see.
How are you running the application? Can you try running the same inside
spark-shell?
Thanks
Best Regards
On Wed, Mar 18, 2015 at 10:51 PM, sprookie cug12...@gmail.com wrote:
Hi All,
I am using Saprk version 1.2 running locally. When I try to read a paquet
file I get below exception, what
Yes, I would suggest spark-notebook too.
It's very simple to setup and it's growing pretty fast.
Paolo
Inviata dal mio Windows Phone
Da: Irfan Ahmadmailto:ir...@cloudphysics.com
Inviato: 19/03/2015 04:05
A: davidhmailto:dav...@annaisystems.com
Cc:
I have a job that sorts data and runs a combineByKey operation and it
sometimes fails with the following error. The job is running on spark 1.2.0
cluster with yarn-client deployment mode. Any clues on how to debug the
error?
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
Hi,
I’m struggling to think of the best way to read a text file into an RDD[Char]
rather than [String]
I can do:
sc.textFile(….) which gives me the Rdd[String],
Can anyone suggest the most efficient way to create the RDD[Char] ? I’m sure
I’ve missed something simple…
Regards,
Mike
Hi,
I try to vectorize on yarn cluster corpus of texts (about 500K texts in 13
files - 100GB totally) located in HDFS .
This process already token about 20 hours on 3 node cluster with 6 cores,
20GB RAM on each node.
In my opinion it's to long :-)
I started the task with the following command:
Heszak,
I have only glanced at it but you should be able to incorporate tokens
approximating n-gram yourself, say by using the lucene
ShingleAnalyzerWrapper API
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapper.html
You might also take a
Not so sure your intention, but something like SELECT sum(val1), sum(val2)
FROM table GROUP BY src, dest ?
-Original Message-
From: Shailesh Birari [mailto:sbirar...@gmail.com]
Sent: Friday, March 20, 2015 9:31 AM
To: user@spark.apache.org
Subject: Spark SQL Self join with agreegate
jeanlyn92:
I was not very clear in previous reply: I meant to refer to
/home/hadoop/mylib/hadoop-lzo-SNAPSHOT.jar
But looks like the distro includes hadoop-lzo-0.4.15.jar
Cheers
On Thu, Mar 19, 2015 at 6:26 PM, jeanlyn92 jeanly...@gmail.com wrote:
That's not enough .The config must appoint
Hello,
I want to use Spark sql to aggregate some columns of the data.
e.g. I have huge data with some columns as:
time, src, dst, val1, val2
I want to calculate sum(val1) and sum(val2) for all unique pairs of src and
dst.
I tried by forming SQL query
SELECT a.time, a.src, a.dst,
Thanks Reza. It makes perfect sense.
Regards,
Manish
From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Thursday, March 19, 2015 11:58 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM
Hi Manish,
With 56431 columns, the output can be as large as 56431
Hi ,
Just 2 follow up questions, please suggest
1. Is there any commercial recommendation engine apart from the open source
tools(Mahout,Spark) that are available that anybody can suggest ?
2. In this case only the purchase transaction is captured. There are no
ratings and no feedback
Thanks Derrick, when I count the unique terms it is very small. So I added
this...
val tfidf_features = lines.flatMap(x = x._2.split( ).filter(_.length
2)).distinct().count().toInt
val hashingTF = new HashingTF(tfidf_features)
--
View this message in context:
I'm trying to cluster short text messages using KMeans, after trained the
kmeans I want to get the top terms (5 - 10). How do I get that using
clusterCenters?
full code is here
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html
--
View
Regarding the first question, can you say more about how you are loading your
data? And what is the size of the data set? And is that the only error you see,
and do you only see it in the streaming version?
For the second question, there are a couple reasons the weights might slightly
differ,
Hi Doug,
I did try setting that config parameter to a larger number (several
minutes), but still wasn't able to retrieve additional context logs. Let us
know if you have any success with it.
Thanks,
Bharath
On Fri, Mar 20, 2015 at 3:21 AM, Doug Balog doug.sparku...@dugos.com
wrote:
I’m seeing
I am trying to debug a Spark Application on a cluster using a master and
several worker nodes. I have been successful at setting up the master node
and worker nodes using Spark standalone cluster manager. I downloaded the
spark folder with binaries and use the following commands to setup worker
kk - I'll put something together and get back to you with more :-)
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.commailto:broo...@annaisystems.com
[cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]
No, Spark is cross-built for 2.11 too, and those are the deps being
pulled in here. This really does however sounds like a Scala 2.10 vs
2.11 mismatch. Check that, for example, your cluster is using the same
build of Spark and that you did not package Spark with your app
On Thu, Mar 19, 2015 at
Hi
Spark 1.2.1 uses Scala 2.10. Because of this, your program fails with scala
2.11
Regards
On Thu, Mar 19, 2015 at 8:17 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
My current simple.sbt is
name := SparkEpiFast
version := 1.0
scalaVersion := 2.11.4
libraryDependencies +=
Can you add your code snippet? Seems it's missing in the original email.
Thanks,
Yin
On Thu, Mar 19, 2015 at 3:22 PM, kamatsuoka ken...@gmail.com wrote:
I'm trying to filter a DataFrame by a date column, with no luck so far.
Here's what I'm doing:
When I run reqs_day.count() I get zero,
Nabble is a third-party site that tries its best to archive mail sent out
over the list. Nothing guarantees it will be in sync with the real mailing
list.
To get the truth on what was sent over this, Apache-managed list, you
unfortunately need to go the Apache archives:
val s = sc.parallelize(Array(foo, bar, baz))
val c = s.flatMap(_.toIterator)
c.collect()
res8: Array[Char] = Array(f, o, o, b, a, r, b, a, z)
On Thu, Mar 19, 2015 at 8:46 AM, Michael Lewis lewi...@me.com wrote:
Hi,
I’m struggling to think of the best way to read a text file into an RDD[Char]
If I read the screenshot correctly, Hadoop lzo jar is under /home/hadoop/mylib
Cheers
On Mar 19, 2015, at 5:37 AM, jeanlyn92 jeanly...@gmail.com wrote:
You should conf as follow:
export
SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native:$HADOOP_HOME/share/hadoop/common/lib/hadoop-lzo-0.4.15.jar
I read the Spark code a little bit, trying to understand my own question.
It looks like the different is really between
org.apache.spark.serializer.JavaSerializer and
org.apache.spark.serializer.KryoSerializer, both having the method named
writeObject.
In my test case, for each line of my text
Hi Yin,
thanks a lot for that! Will give it a shot and let you know.
On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote:
Was the OOM thrown during the execution of first stage (map) or the second
stage (reduce)? If it was the second stage, can you increase the value
of
Hi,
I am encountering the following error with a Spark application.
Exception in thread main org.apache.spark.SparkException:
Job aborted due to stage failure:
Serialized task 0:0 was 11257268 bytes, which exceeds max allowed:
spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes).
I am trying to understand how to load balance the incoming data to multiple
spark streaming workers. Could somebody help me understand how I can
distribute my incoming data from various sources such that incoming data is
going to multiple spark streaming nodes? Is it done by spark client with
help
hi all - thx for the alacritous replies! so regarding how to get things from
notebook to spark and back, am I correct that spark-submit is the way to go?
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.commailto:broo...@annaisystems.com
Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn.
When a single row is dense, that can end up overwhelming a machine. You can
push that up with more RAM, but note that DIMSUM is meant for tall and
skinny matrices: so it scales linearly and across cluster with rows,
I prefer using search-hadoop.com which provides better search capability.
Cheers
On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Nabble is a third-party site that tries its best to archive mail sent out
over the list. Nothing guarantees it will be in sync
Hello All,
I'm using Spark for streaming but I'm unclear one which implementation
language to use Java, Scala or Python.
I don't know anything about Python, familiar with Scala and have been doing
Java for a long time.
I think the above shouldn't influence my decision on which language to use
Sure, you can use Nabble or search-hadoop or whatever you prefer.
My point is just that the source of truth are the Apache archives, and
these other sites may or may not be in sync with that truth.
On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote:
I prefer using
Yes, that is mostly why these third-party sites have sprung up around the
official archives--to provide better search. Did you try the link Ted
posted?
On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
It seems that those archives are not necessarily easy to
It seems that those archives are not necessarily easy to find stuff in. Is
there a search engine on top of them? so as to find e.g. your own posts
easily?
On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Sure, you can use Nabble or search-hadoop or whatever
Interesting points. Yes I just tried
http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view
and I see responses there now. I believe Ted was right in that, there's a
delay before they show up there (probably
Here is the reason on why results on search site may be delayed, especially
for Apache JIRAs.
If they crawl too often, Apache would flag the bot and blacklist it.
Cheers
On Thu, Mar 19, 2015 at 7:59 AM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Interesting points. Yes I just tried
Try writing this Spark Streaming idiom in Java and you'll choose Scala soon
enough:
dstream.foreachRDD{rdd =
rdd.foreachPartition( partition = )
}
When deciding between Java and Scala for Spark, IMHO Scala has the
upperhand. If you're concerned with readability, have a look at the Scala
Is it possible that `spark.local.dir` is overriden by others? The docs say:
NOTE: In Spark 1.0 and later this will be overriden by
SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN)
On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia sparkpeng...@gmail.com wrote:
Hi Sean,
Thank very much for
the options of spark-submit should come before main.py, or they will
become the options of main.py, so it should be:
../hadoop/spark-install/bin/spark-submit --py-files
/home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py
--master spark://spark-m:7077 main.py
I'm trying to filter a DataFrame by a date column, with no luck so far.
Here's what I'm doing:
When I run reqs_day.count() I get zero, apparently because my date parameter
gets translated to 16509.
Is this a bug, or am I doing it wrong?
--
View this message in context:
On Mon, Mar 16, 2015 at 6:23 AM, kevindahl kevin.d...@gmail.com wrote:
kevindahl wrote
I'm trying to create a spark data frame from a pandas data frame, but for
even the most trivial of datasets I get an error along the lines of this:
We are building a wrapper that makes it possible to use reactive streams
(i.e. Observable, see reactivex.io) as input to Spark Streaming. We
therefore tried to create a custom receiver for Spark. However, the
Observable lives at the driver program and is generally not serializable.
Is it possible
You could submit additional Python source via --py-files , for example:
$ bin/spark-submit --py-files work.py main.py
On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez guilla...@databerries.com wrote:
Hello guys,
I am having a hard time to understand how spark-submit behave with multiple
files. I
My current simple.sbt is
name := SparkEpiFast
version := 1.0
scalaVersion := 2.11.4
libraryDependencies += org.apache.spark % spark-core_2.11 % 1.2.1 %
provided
libraryDependencies += org.apache.spark % spark-graphx_2.11 % 1.2.1 %
provided
While I do sbt package, it compiles successfully.
IIRC you have to set that configuration on the Worker processes (for
standalone). The app can't override it (only for a client-mode
driver). YARN has a similar configuration, but I don't know the name
(shouldn't be hard to find, though).
On Thu, Mar 19, 2015 at 11:56 AM, Davies Liu
For YARN, possibly this one ?
property
nameyarn.nodemanager.local-dirs/name
value/hadoop/yarn/local/value
/property
Cheers
On Thu, Mar 19, 2015 at 2:21 PM, Marcelo Vanzin van...@cloudera.com wrote:
IIRC you have to set that configuration on the Worker processes (for
I’m seeing the same problem.
I’ve set logging to DEBUG, and I think some hints are in the “Yarn AM launch
context” that is printed out
before Yarn runs java.
My next step is to talk to the admins and get them to set
yarn.nodemanager.delete.debug-delay-sec
in the config, as recommended in
Many thanks all for the good responses, appreciated.
On Thu, Mar 19, 2015 at 8:36 AM, James King jakwebin...@gmail.com wrote:
Thanks Khanderao.
On Wed, Mar 18, 2015 at 7:18 PM, Khanderao Kand Gmail
khanderao.k...@gmail.com wrote:
I have used various version of spark (1.0, 1.2.1) without
I meant table properties and serde properties are used to store metadata of
a Spark SQL data source table. We do not set other fields like SerDe lib.
For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
table should not show unrelated stuff like Serde lib and InputFormat. I
have
I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to
1.3.0, so I changed my `build.sbt` like so:
-libraryDependencies += org.apache.spark %% spark-core % 1.2.1 %
“provided
+libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
provided
then make an
Hi all,
DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
schema _and_ storage format in the Hive metastore, so that the table
cannot be read from inside Hive. Spark itself can read the table, but
Hive throws a
Thanks for the assistance, I found the error it wan something I had donep;
PEBCAK. I had placed a version of the elasticsearch-hadoop.2.1.0.BETA3 in
the project/lib directory causing it to be managed dependency and being
brought in first, even though the build.sbt had the correct version
Hi Christian,
Your table is stored correctly in Parquet format.
For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
data source table (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
We are only using Hive's metastore to store the metadata (to
JAVA_HOME, an environment variable, should be defined on the node where
appattempt_1420225286501_4699_02 ran.
Cheers
On Thu, Mar 19, 2015 at 8:59 AM, Williams, Ken ken.willi...@windlogics.com
wrote:
I’m trying to upgrade a Spark project, written in Scala, from Spark
1.2.1 to 1.3.0, so I
Many thanks Gerard, this is very helpful. Cheers!
On Thu, Mar 19, 2015 at 4:02 PM, Gerard Maas gerard.m...@gmail.com wrote:
Try writing this Spark Streaming idiom in Java and you'll choose Scala
soon enough:
dstream.foreachRDD{rdd =
rdd.foreachPartition( partition = )
}
When
Hello James,
I've been working with Spark Streaming for the last 6 months, and I'm
coding in Java 7. Even though I haven't encountered any blocking issues
with that combination, I'd definitely pick Scala if the decision was up to
me.
I agree with Gerard and Charles on this one. If you can, go
Scala is the language used to write Spark so there's never a situation in
which features introduced in a newer version of Spark cannot be taken
advantage of if you write your code in Scala. (This is mostly true of Java,
but it may be a little more legwork if a Java-friendly adapter isn't
available
I second what has been said already.
We just built a streaming app in Java and I would definitely choose Scala
this time.
Regards,
Jeff
2015-03-19 16:34 GMT+01:00 Emre Sevinc emre.sev...@gmail.com:
Hello James,
I've been working with Spark Streaming for the last 6 months, and I'm
coding in
Was the OOM thrown during the execution of first stage (map) or the second
stage (reduce)? If it was the second stage, can you increase the value
of spark.sql.shuffle.partitions and see if the OOM disappears?
This setting controls the number of reduces Spark SQL will use and the
default is 200.
Hi Yin,
Thanks for the clarification. My first reaction is that if this is the
intended behavior, it is a wasted opportunity. Why create a managed
table in Hive that cannot be read from inside Hive? I think I
understand now that you are essentially piggybacking on Hive's
metastore to persist
From: Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com
Date: Thursday, March 19, 2015 at 11:05 AM
JAVA_HOME, an environment variable, should be defined on the node where
appattempt_1420225286501_4699_02 ran.
Has this behavior changed in 1.3.0 since 1.2.1 though? Using 1.2.1 and
82 matches
Mail list logo