Hi,
Kindly subscribe me to the user group.
Regards,
Sathyanarayanan
Hi Buntu,
You could something similar to the following:
val receiver_stream = new ReceiverInputDStream(ssc) {
override def getReceiver(): Receiver[Nothing] = ??? //Whatever
}.map((x : String) = (null, x))
val config = new Configuration()
config.set(mongo.output.uri,
For me inference is not an issue as compared to persistence.
Imagine a Streaming application where the input is JSON whose format can
vary from row to row and whose format I cannot pre-determine.
I can use `sqlContext.jsonRDD` , but once I have the `SchemaRDD`, there is
no way for me to update the
Hi,
I am no expert but my best guess is that its a 'closure' problem.Spark map
reduce internally does a closure of all the variables outside its scope
which are being used for the map operation.It does a serialization check for
the map task . Since class scala.util.Random is not serializable it
Hi, ALL
I have a RDD of case class T and T contains several primitive types and a Map.
How can I convert this to a SchemaRDD?
Best Regards,
Kevin.
Hi there,
we have a small Spark cluster running and are processing around 40 GB of
Gzip-compressed JSON data per day. I have written a couple of word count-like
Scala jobs that essentially pull in all the data, do some joins, group bys and
aggregations. A job takes around 40 minutes to
I'm no expert, but looked into how the python bits work a while back (was
trying to assess what it would take to add F# support). It seems python hosts a
jvm inside of it, and talks to scala spark in that jvm. The python server bit
translates the python calls to those in the jvm. The python
What version of Spark are you running? Some recent changes
https://spark.apache.org/releases/spark-release-1-1-0.html to how PySpark
works relative to Scala Spark may explain things.
PySpark should not be that much slower, not by a stretch.
On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab
We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not
that...
On 22.10.2014, at 13:02, Nicholas Chammas nicholas.cham...@gmail.com wrote:
What version of Spark are you running? Some recent changes to how PySpark
works relative to Scala Spark may explain things.
You needn't do anything, the implicit conversion should do this for you.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L103
Thanks Matt,
Unlike the feared RDD operations on the driver, it's my understanding that
these Dstream ops on the driver are merely creating an execution plan for
each RDD.
My question still remains: Is it better to foreachRDD early in the process
or do as much Dstream transformations before going
Total guess without knowing anything about your code: Do either of these
two notes from the 1.1.0 release notes
http://spark.apache.org/releases/spark-release-1-1-0.html affect things
at all?
- PySpark now performs external spilling during aggregations. Old
behavior can be restored by
The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
1) How to save result values of a query into a list ?
[CH:] val list: Array[Row] = output.collect, however get 1M records into
Hi,
Yes, I can always reproduce the issue:
about you workload, Spark configuration, JDK version and OS version?
I ran SparkPI 1000
java -version
java version 1.7.0_67
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
cat
PS: Just to clarify my statement:
Unlike the feared RDD operations on the driver, it's my understanding
that these Dstream ops on the driver are merely creating an execution plan
for each RDD.
With feared RDD operations on the driver I meant to contrast an rdd
action like rdd.collect that would
Thanks a lot, I will try to reproduce this in my local settings and dig into
the details, thanks for your information.
BR
Jerry
From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com]
Sent: Wednesday, October 22, 2014 8:35 PM
To: Shao, Saisai
Cc: arthur.hk.c...@gmail.com; user
Hi,
FYI, I use snappy-java-1.0.4.1.jar
Regards
Arthur
On 22 Oct, 2014, at 8:59 pm, Shao, Saisai saisai.s...@intel.com wrote:
Thanks a lot, I will try to reproduce this in my local settings and dig into
the details, thanks for your information.
BR
Jerry
From:
Interesting thread Marius,
Btw, I'm curious about your cluster size.
How small it is in terms of ram and cores.
Arian
2014-10-22 13:17 GMT+01:00 Nicholas Chammas nicholas.cham...@gmail.com:
Total guess without knowing anything about your code: Do either of these
two notes from the 1.1.0
I have a PairRDD of type String,String which I persist to S3 (using the
following code).
JavaPairRDDText, Text aRDDWritable = aRDD.mapToPair(new
ConvertToWritableTypes());aRDDWritable.saveAsHadoopFile(outputFile, Text.class,
Text.class, SequenceFileOutputFormat.class);
class
You can enable rdd compression (*spark.rdd.compress*) also you can
use MEMORY_ONLY_SER (
*sc.sequenceFile[String,String](s3n://somebucket/part-0).persist(StorageLevel.MEMORY_ONLY_SER*
*)* ) to reduce the rdd size in memory.
Thanks
Best Regards
On Wed, Oct 22, 2014 at 7:51 PM, Darin McBeath
unsubscribe
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
One thing to remember is that Strings are composed of chars in Java,
which take 2 bytes each. The encoding of the text on disk on S3 is
probably something like UTF-8, which takes much closer to 1 byte per
character for English text. This might explain the factor of ~2
difference.
On Wed, Oct 22,
See first section of http://spark.apache.org/community
On Wed, Oct 22, 2014 at 7:42 AM, Margusja mar...@roo.ee wrote:
unsubscribe
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Didn’t seem to help:
conf = SparkConf().set(spark.shuffle.spill,
false).set(spark.default.parallelism, 12)
sc = SparkContext(appName=’app_name', conf = conf)
but still taking as much time
On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote:
Total guess without knowing
Hello,
I would like to parallelize my work on multiple RDDs I have. I wanted
to know if spark can support a foreach on an RDD of RDDs. Here's a
java example:
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName(testapp);
No, there's no such thing as an RDD of RDDs in Spark.
Here though, why not just operate on an RDD of Lists? or a List of RDDs?
Usually one of these two is the right approach whenever you feel
inclined to operate on an RDD of RDDs.
On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini
Wild guess maybe, but do you decode the json records in Python ? it could
be much slower as the default lib is quite slow.
If so try ujson [1] - a C implementation that is at least an order of
magnitude faster.
HTH
[1] https://pypi.python.org/pypi/ujson
2014-10-22 16:51 GMT+02:00 Marius
Spark, in general, is good for iterating through an entire dataset again
and again. All operations are expressed in terms of iteration through all
the records of at least one partition. You may want to look at IndexedRDD (
https://issues.apache.org/jira/browse/SPARK-2365) that aims to improve
Hi Michael Campbell,
Are you deploying against yarn or standalone mode? In yarn try setting the
shell variables SPARK_EXECUTOR_MEMORY=2G in standalone try and
set SPARK_WORKER_MEMORY=2G.
Cheers,
Holden :)
On Thu, Oct 16, 2014 at 2:22 PM, Michael Campbell
michael.campb...@gmail.com wrote:
You can use --spark-version argument to spark-ec2 to specify a GIT hash
corresponding to the version you want to checkout. If you made changes that
are not in the master repository, you can use --spark-git-repo to specify
the git repository to pull down spark from, which contains the specified
Hi gpatcham,
If you want to save as a sequence file with a custom compression type you
can use saveAsHadoopFile along with setting the
mapred.output.compression.type on the jobconf. If you want to keep using
saveAsSequenceFile, and the syntax is much nicer, you could also set that
property on
This works for me
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v1=Vectors.dense(Array(1d,2d))
val v2=Vectors.dense(Array(3d,4d))
val rows=sc.parallelize(List(v1,v2))
val mat=new RowMatrix(rows)
val svd:
In the master, you can easily profile you job, find the bottlenecks,
see https://github.com/apache/spark/pull/2556
Could you try it and show the stats?
Davies
On Wed, Oct 22, 2014 at 7:51 AM, Marius Soutier mps@gmail.com wrote:
It’s an AWS cluster that is rather small at the moment, 4
Thank you very much ,
two more small questions :
1) val output = sqlContext.sql(SELECT * From people)
my output has 128 columns and single row .
how to find the which column has the maximum value in a single row using
scala ?
2) as each row has 128 columns how to print each row into a text
Hi Saiph,
Patrick McFadin and Helena Edelson from DataStax taught a tutorial at NYC
Strata last week where they created a prototype Spark Streaming + Kafka
application for time series data.
You can see the code here:
https://github.com/killrweather/killrweather
On Tue, Oct 21, 2014 at 4:33 PM,
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could
be much slower as the default lib is quite slow.
Oh yeah, this is a good place to look. Also, just upgrading to Python 2.7
may be enough
We want to run multiple instances of spark sql cli on our yarn cluster.
Each instance of the cli is to be used by a different user. This looks
non-optimal if each user brings up a different cli given how spark works on
yarn by running executor processes (and hence consuming resources) on
worker
Thanks Daniil! if I use --spark-git-repo, is there a way to specify the mvn
command line parameters? like following
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
mvn -Pyarn -Phadoop-2.3 -Phbase-hadoop2 -Dhadoop.version=2.3.0 -DskipTests
clean package
--
View this
Hi Spark devs/users,
One of the things we are investigating here at Netflix is if Spark would
suit us for our ETL needs, and one of requirements is multi tenancy.
I did read the official doc
http://spark.apache.org/docs/latest/job-scheduling.html and the book, but
I'm still not clear on certain
Hi all,
With SPARK-3948, the exception in Snappy PARSING_ERROR is gone, but
I've another exception now. I've no clue about what's going on; does
anyone run into similar issue? Thanks.
This is the configuration I use.
spark.rdd.compress true
spark.shuffle.consolidateFiles true
Hi Ashwin,
Let me try to answer to the best of my knowledge.
On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
ashwinshanka...@gmail.com wrote:
Here are my questions :
1. Sharing spark context : How exactly multiple users can share the cluster
using same spark
context ?
That's not
Hi,
How can I join neighbor sliding windows in spark streaming?
Thanks,
Josh
It seems that this issue should be addressed by
https://github.com/apache/spark/pull/2890 ? Am I right?
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Wed, Oct 22, 2014 at 11:54 AM, DB
Or can it be solved by setting both of the following setting into true for now?
spark.shuffle.spill.compress true
spark.shuffle.compress ture
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
We've been getting some OOMs from the spark master since upgrading to Spark
1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the
worker heap, which as far as I know is fine. Is there any setting which
*only* increases the master heap size?
Keith
PS, sorry for spamming the mailing list. Based my knowledge, both
spark.shuffle.spill.compress and spark.shuffle.compress are default to
true, so in theory, we should not run into this issue if we don't
change any setting. Is there any other big we run into?
Thanks.
Sincerely,
DB Tsai
Another approach could be to create artificial keys for each RDD and
convert to PairRDDs. So your first RDD becomes
JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on
Second RDD becomes rdd2 is 2, a; 2, b;2,c
You can union the two RDDs, groupByKey, countByKey etc and maybe achieve
what
Hi,
I just tried sample PI calculation on Spark Cluster, after returning the Pi
result, it shows ERROR ConnectionManager: Corresponding SendingConnection to
ConnectionManagerId(m37,35662) not found
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
spark://m33:7077
Thanks Marcelo, that was helpful ! I had some follow up questions :
That's not something you might want to do usually. In general, a
SparkContext maps to a user application
My question was basically this. In this
http://spark.apache.org/docs/latest/job-scheduling.html page in the
official doc,
Hi,
I just wonder if SparkSQL supports Hive built-in functions (e.g.
from_unixtime) or any of the functions pointed out here : (
https://cwiki.apache.org/confluence/display/Hive/Tutorial)
best,
/Shahab
Yeah we’re using Python 2.7.3.
On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote:
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could be
much slower as the
Can’t install that on our cluster, but I can try locally. Is there a pre-built
binary available?
On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote:
In the master, you can easily profile you job, find the bottlenecks,
see https://github.com/apache/spark/pull/2556
Could you try
On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
ashwinshanka...@gmail.com wrote:
That's not something you might want to do usually. In general, a
SparkContext maps to a user application
My question was basically this. In this page in the official doc, under
Scheduling within an application
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote:
No, there's no such thing as an RDD of RDDs in Spark.
Here though, why not just operate on an RDD of Lists? or a List of RDDs?
Usually one of these two is the right approach whenever you feel
inclined to operate on an
Yes, when using a HiveContext.
On Wed, Oct 22, 2014 at 2:20 PM, shahab shahab.mok...@gmail.com wrote:
Hi,
I just wonder if SparkSQL supports Hive built-in functions (e.g.
from_unixtime) or any of the functions pointed out here : (
https://cwiki.apache.org/confluence/display/Hive/Tutorial)
Sorry, there is not, you can try clone from github and build it from
scratch, see [1]
[1] https://github.com/apache/spark
Davies
On Wed, Oct 22, 2014 at 2:31 PM, Marius Soutier mps@gmail.com wrote:
Can’t install that on our cluster, but I can try locally. Is there a
pre-built binary
The JDBC server is what you are looking for:
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
On Wed, Oct 22, 2014 at 11:10 AM, Sadhan Sood sadhan.s...@gmail.com wrote:
We want to run multiple instances of spark sql cli on our yarn cluster.
Each
Hi Keith,
Would be helpful if you could post the error message.
Are you running Spark in Standalone mode or with YARN?
In general, the Spark Master is only used for scheduling and it should be
fine with the default setting of 512 MB RAM.
Is it actually the Spark Driver's memory that you
The implicit conversion function mentioned by Hao is createSchemaRDD in
SQLContext/HiveContext.
You can import it by doing
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Or new org.apache.spark.sql.hive.HiveContext(sc) for HiveContext
import sqlContext.createSchemaRDD
On Wed, Oct
Hi,
I'm wondering how to use Mllib for solving equation systems following this
pattern
2*x1 + x2 + 3*x3 + + xn = 0
x1 + 0*x2 + 3*x3 + + xn = 0
..
..
0*x1 + x2 + 0*x3 + + xn = 0
I definitely still have some reading to do to really understand the direct
solving
Hi Martin,
This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;]
and x is what you want to find..B is 0 in this case...For mllib normally
this is labelbasically create a labeledPoint where label is 0 always...
Use mllib's linear regression and solve the following
I modified the pom files in my private repo to use those parameters as
default to solve the problem. But after the deployment, I found the
installed version is not the customized version, but an official one. Anyone
please give a hint on how the spark-ec2 work with spark from private repos..
--
You may be running into this issue:
https://issues.apache.org/jira/browse/SPARK-4019
You could check by having 2000 or fewer reduce partitions.
On Wed, Oct 22, 2014 at 1:48 PM, DB Tsai dbt...@dbtsai.com wrote:
PS, sorry for spamming the mailing list. Based my knowledge, both
On a related note, how are you submitting your job?
I have a simple streaming proof of concept and noticed that everything runs
on my master. I wonder if I do not have enough load for spark to push tasks
to the slaves.
Thanks
Andy
From: Daniel Mahler dmah...@gmail.com
Date: Monday, October
Hi,
I have managed to resolve it because a wrong setting. Please ignore this .
Regards
Arthur
On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up
Hi,
I got java.lang.NullPointerException. Please help!
sqlContext.sql(select l_orderkey, l_linenumber, l_partkey, l_quantity,
l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit
10).collect().foreach(println);
2014-10-23 08:20:12,024 INFO
I'm trying to submit a simple test code through spark-submit. first portion
of the code works fine, but some calls to breeze vector library fails:
14/10/22 17:36:02 INFO CacheManager: Partition rdd_1_0 not found, computing
it
14/10/22 17:36:02 ERROR Executor: Exception in task 0.0 in stage 0.0
Hi Yang,
It looks like your build file is a different version than the version of
Spark you are running against. I'd try building against the same version of
spark as you are running your application against (1.1.0). Also what is
your assembly/shading configuration for your build?
Cheers,
Another wild guess, if your data is stored in S3, you might be running into
an issue where the default jets3t properties limits the number of parallel
S3 connections to 4. Consider increasing the max-thread-counts from here:
http://www.jets3t.org/toolkit/configuration.html.
On Tue, Oct 21, 2014
Hi
I have a json file that can be load by sqlcontext.jsonfile into a
table. but this table is not partitioned.
Then I wish to transform this table into a partitioned table say on
field “date” etc. what will be the best approaching to do this? seems in hive
this is usually
Hi,Please find the attached file.{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fnil\fcharset0 Menlo-Regular;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww26300\viewh12480\viewkind0
Hello Experts,
I created a table using spark-sql CLI. No Hive is installed. I am using
spark 1.1.0.
create table date_test(my_date timestamp)
row format delimited
fields terminated by ' '
lines terminated by '\n'
LOCATION '/user/hive/date_test';
The data file has following data:
2014-12-11
Hi
May I know where to configure Spark to load libhadoop.so?
Regards
Arthur
On 23 Oct, 2014, at 11:31 am, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
Please find the attached file.
lsof.rtf
my spark-default.xml
# Default system properties included when running
Hi all,
I am new to spark/graphx and am trying to use partitioning strategies in
graphx on spark 1.0.0
The workaround I saw on the main page seems not to compile.
The code I added was
def partitionBy[ED](edges: RDD[Edge[ED]], partitionStrategy:
PartitionStrategy): RDD[Edge[ED]] = {
val
Thanks!
On Thu, Oct 23, 2014 at 10:56 AM, Akshat Aranya aara...@gmail.com wrote:
Yes, that is a downside of Spark's design in general. The only way to
share data across consumers of data is by having a separate entity that
owns the Spark context. That's the idea behind Ooyala's job server.
Seems you just add snappy library into your classpath:
export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar
But for spark itself, it depends on snappy-0.2.jar. Is there any possibility
that this problem caused by different version of snappy?
Thanks
Jerry
From:
I started building Spark / running Spark tests this weekend and on maybe
5-10 occasions have run into a compiler crash while compiling
DataTypeConversions.scala.
Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full
gist of an innocuous test command (mvn test
Hi, please take a look at the attached screen-shot. I wonders what's the
Memory Used column mean.
I give 2GB memory to the driver process and 12GB memory to the executor
process.
Thank you!
during tests, I often modify my code a little bit and want to see the
result.
but spark-submit requires the full fat-jar, which takes quite a lot of time
to build.
I just need to run in --master local mode. is there a way to run it without
rebuilding the fat jar?
thanks
Yang
We have narrowed this hanging issue down to the calliope package
that we used to create RDD from reading cassandra table.
The calliope native RDD interface seems hanging and I have decided to switch
to the calliope cql3 RDD interface.
--
View this message in context:
i think you can give a list of jars - not just one - to spark-submit, so
build only the one that has changed source code.
On Wed, Oct 22, 2014 at 10:29 PM, Yang tedd...@gmail.com wrote:
during tests, I often modify my code a little bit and want to see the
result.
but spark-submit
81 matches
Mail list logo