There isn't a conf/spark-defaults.conf file in the .tgz. There's a template
file, but we didn't think we'd need one. I assumed using the defaults and
anything we wanted to override would be in the properties file we load via
--properties-file, as well as command line parms (--master etc).
On
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your
s3
Thanks
Best Regards
On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera
gianluca.privite...@studio.unibo.it wrote:
In Spark website it’s stated in the View After the Fact section (
On the master node, I see this printed over and over in the
mesos-master.WARNING log file:
W0615 06:06:51.211262 8672 hierarchical_allocator_process.hpp:589] Using
the default value of 'refuse_seconds' to create the refused resources
filter because the input value is negative
Here's what I see
In Spark website it’s stated in the View After the Fact section
(https://spark.apache.org/docs/latest/monitoring.html) that you can point the
start-history-server.sh script to a directory in order do view the Web UI using
the logs as data source.
Is it possible to point that script to S3?
Thank you for the answer, it doesn't seem to work neither (I've not log
into the machine as the spark user, but kinit inside the spark-env script),
and also tried inside the job.
I've notice when I run pyspark that the kerberos token is used for
something, but this same behavior is not presented
Thanks Akhil for taking this point, I am also talking about the MQ bottleneck.
I am currently having 5 receivers for a unreliable Websphere MQ receiver
implementations.
Is there any proven way to convert this implementation to reliable one ?
Regards,
Umesh Chaudhary
From: Akhil Das
Hi all,
Is there a way to stop streaming context when some batch processing failed?
I want to set reasonable reties count, say 10, and if failed - stop context
completely.
Is that possible?
Hi All,
In my usecase HDFS file as source for Spark Stream,
the job will process the data line by line but how will make sure to
maintain the offset line number(data already processed) while restarting/new
code push .
Team can you please reply on this is there any configuration in Spark.
https://spark.apache.org/docs/latest/monitoring.html
also subscribe to various Listeners for various Metrcis Types e.g. Job
Stats/Statuses - this will allow you (in the driver) to decide when to stop
the context gracefully (the listening and stopping can be done from a
completely
a skew join (where the dominant key is spread across multiple executors) is
pretty standard in other frameworks, see for example in scalding:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala
this would be a great addition to
Hi Josh
It was great meeting u in person at the spark-summit SFO yesterday.
Thanks for discussing potential solutions to the problem.
I verified that 2 hive gateway nodes had not been configured correctly. My bad.
I added hive-site.xml to the spark Conf directories for these 2 additional hive
On Fri, Jun 12, 2015 at 9:43 PM, Michael Armbrust mich...@databricks.com
wrote:
2. Does 1.3.2 or 1.4 have any enhancements that can help? I tried to use
1.3.1 but SPARK-6967 prohibits me from doing so.Now that 1.4 is
available, would any of the JOIN enhancements help this situation?
I
It gives me an exception with org.apache.spark.deploy.history.FsHistoryProvider
, a problem with the file system. I can reproduce the exception if you want.
It perfectly works if I give a local path, I tested it in 1.3.0 version.
Gianluca
On 16 Jun 2015, at 15:08, Akhil Das
This is 1.3.1 Ayman Farahat
--
View my research on my SSRN Author page:
http://ssrn.com/author=1594571
From: Nick Pentreath nick.pentre...@gmail.com
To: user@spark.apache.org user@spark.apache.org
Sent: Tuesday, June 16, 2015 4:23 AM
I realize that there are a lot of ways to configure my application in spark.
The part that is not clear is that how do I decide say for example in how
many partitions should I divide my data or how much ram should I have or how
many workers should one initialize?
--
View this message in
Hello guys,
I faced one problem that I cannot pass my data inside rdd partition when I was
trying to develop spark streaming feature.I'm the newcomer of Spark, could you
please give me any suggestion on this problem?
The figure in the attachment is the code I used in my program:
After I run my
Hi there,
I am looking to use Mockito to mock out some functionality while unit testing a
Spark application.
I currently have code that happily runs on a cluster, but fails when I try to
run unit tests against it, throwing a SparkException:
org.apache.spark.SparkException: Job aborted due to
Best is by measuring and recording how The Performance of your solution
scales as The Workload scales - recording As In Data Points recording and
then you can do some times series stat analysis and visualizations
For example you can start with a single box with e.g. 8 CPU cores
Use e.g. 1 or
I updated code sample so people can understand better what are my inputs and
outputs.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-RDD-from-Iterable-from-groupByKey-results-tp23328p23341.html
Sent from the Apache Spark User List mailing list
Hi Esten,
Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file
path you specified is local?.
mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json”,
Error below shows that the Input path you specified does not exist on the
cluster. Pointing to the right
The error you are running into is that the input file does not exist -- You
can see it from the following line
Input path does not exist: hdfs://smalldata13.hdp:8020/
home/esten/ami/usaf.json
Thanks
Shivaram
On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com wrote:
Hi,
In SparkR
hey guys
After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is
not supported by Databricks cloud.My speed bottleneck is to transfer ~1TB of
snapshot HDFS data (250+ external hive tables) to S3 :-(
I want to use databricks cloud but this to me is a starting disabler.The
hey guys
I have CDH 5.3.3 with Spark 1.2.0 (on Yarn)
This does not work /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql
--deploy-mode client --master yarn --driver-memory 1g -e select j.person_id,
p.first_name, p.last_name, count(*) from (select person_id from
cdr.cdr_mjp_joborder where
+cc user@spark.apache.org
Reply inline.
On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
Sauptik.Dhar wrote:
Hi DB,
Thank you for the reply. That explains a lot.
I however had a few points regarding this:-
1. Just to help with the debate of not regularizing the b parameter. A
Hi Sujit,
That's a good point. But 1-hot encoding will make our data changing from
Terabytes to Petabytes, because we have tens of categorical attributes, and
some of them contain thousands of categorical values.
Is there any way to make a good balance of data size and right
representation of
Hi Rexx,
In general (ie not Spark specific), its best to convert categorical data to
1-hot encoding rather than integers - that way the algorithm doesn't use
the ordering implicit in the integer representation.
-sujit
On Tue, Jun 16, 2015 at 1:17 PM, Rex X dnsr...@gmail.com wrote:
Is it
You could consider using Zeppelin and spark on yarn as an alternative.
http://zeppelin.incubator.apache.org/
Simon
On 16 Jun 2015, at 17:58, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
hey guys
After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS
Is it necessary to convert categorical data into integers?
Any tips would be greatly appreciated!
-Rex
On Sun, Jun 14, 2015 at 10:05 AM, Rex X dnsr...@gmail.com wrote:
For clustering analysis, we need a way to measure distances.
When the data contains different levels of measurement -
Hello
I would like to Multiply two matrices
C = A* B
A is a m x k , B is a kxl
k,l m so that B can easily fit in memory.
Any ideas or suggestions how to do that in Pyspark?
Thanks
Ayman
--
View this message in context:
That's great news. Can I assume spark on EMR supports kinesis to hbase
pipeline?
On 17 Jun 2015 05:29, kamatsuoka ken...@gmail.com wrote:
Spark is now officially supported on Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/details/spark/
--
View this message in context:
I'd like to understand better what happens when a streaming consumer job
(with direct streaming, but also with receiver-based streaming) is
killed/terminated/crashes.
Assuming it was processing a batch of RDD data, what happens when the job is
restarted? How much state is maintained within
I'm looking at the doc here:
https://spark.apache.org/docs/latest/monitoring.html.
Is there a way to define custom metrics in Spark, via Coda Hale perhaps, and
emit those?
Can a custom metrics sink be defined?
And, can such a sink collect some metrics, execute some metrics handling
logic, then
As discussed during the meetup, the following information should help while
creating a topic on the User mailing list.
1) Version of Spark and Hadoop should be included to help reproduce the
issue or understand if the issue is a version limitation
2) Explanation about the scenario in as much
What is Spark's data retention policy?
As in, the jobs that are sent from the master to the worker nodes, how long
do they persist on those nodes? What about the RDD data, how is that
cleaned up? Are all RDD's cleaned up at GC time unless they've been
.persist()'ed or .cache()'ed?
--
View
this would be a great addition to spark, and ideally it belongs in spark
core not sql.
I agree with the fact that this would be a great addition, but we would
likely want a specialized SQL implementation for performance reasons.
Hello,
Is the json file in HDFS or local?
/home/esten/ami/usaf.json is this an HDFS path?
Suggestions:
1) Specify file:/home/esten/ami/usaf.json
2) Or move the usaf.json file into HDFS since the application is looking for
the file in HDFS.
Please let me know if that helps.
Thank you.
--
I would really appreciate if someone could help me with this.
On Monday, June 15, 2015, Mohammad Tariq donta...@gmail.com wrote:
Hello list,
The method *insertIntoJDBC(url: String, table: String, overwrite:
Boolean)* provided by Spark DataFrame allows us to copy a DataFrame into
a JDBC DB
Hi folks,
running into a pretty strange issue -- I have a ClassNotFound exception
from a closure?! My code looks like this:
val jRdd1 = table.map(cassRow={
val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1))
Row.fromSeq(lst)
})
println(sThis one worked
When all else fails look at the source ;)
Looks like createJDBCTable is deprecated, but otherwise goes to the same
implementation as insertIntoJDBC...
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
You can also look at DataFrameWriter in
Hi All,
I am evaluating spark VS storm ( spark streaming ) and i am not able to see
what is equivalent of Bolt in storm inside spark.
Any help will be appreciated on this ?
Thanks ,
Ashish
-
To unsubscribe, e-mail:
If you run Spark on YARN, the simplest way is replace the
$SPARK_HOME/lib/spark-.jar with your own version spark jar file and run
your application.
The spark-submit script will upload this jar to YARN cluster automatically
and then you can run your application as usual.
It does not care about
I have a similar scenario where we need to bring data from kinesis to
hbase. Data volecity is 20k per 10 mins. Little manipulation of data will
be required but that's regardless of the tool so we will be writing that
piece in Java pojo.
All env is on aws. Hbase is on a long running EMR and
*Problem Statement:*
While doing query on a partitioned table using Spark SQL (Version 1.4.0),
access denied exception is observed on the partition the user doesn’t belong
to (The user permission is controlled using HDF ACLs). The same works
correctly in hive.
*Usercase:* /To address
I have a use-case where a stream of Incoming events have to be aggregated and
joined to create Complex events. The aggregation will have to happen at an
interval of 1 minute (or less).
The pipeline is : send events
enrich
To clarify, I am using the spark standalone cluster.
On Tuesday, June 16, 2015, Yanbo Liang yblia...@gmail.com wrote:
If you run Spark on YARN, the simplest way is replace the
$SPARK_HOME/lib/spark-.jar with your own version spark jar file and run
your application.
The spark-submit
Hi Dhar,
For standardization, we can disable it effectively by using
different regularization on each component. Thus, we're solving the
same problem but having better rate of convergence. This is one of the
features I will implement.
Sincerely,
DB Tsai
Probably overloading the question a bit.
In Storm, Bolts have the functionality of getting triggered on events. Is
that kind of functionality possible with Spark streaming? During each phase
of the data processing, the transformed data is stored to the database and
this transformed data should
Please file a JIRA for it.
On Mon, Jun 15, 2015 at 8:00 AM, mrm ma...@skimlinks.com wrote:
Hi all,
I was looking for an explanation on the number of partitions for a joined
rdd.
The documentation of Spark 1.3.1. says that:
For distributed shuffle operations like reduceByKey and join, the
In general, you should avoid making direct changes to the Spark source code. If
you are using Scala, you can seamlessly blend your own methods on top of the
base RDDs using implicit conversions.
Regards,
Will
On June 16, 2015, at 7:53 PM, raggy raghav0110...@gmail.com wrote:
I am trying to
The programming models for the two frameworks are conceptually rather
different; I haven't worked with Storm for quite some time, but based on my old
experience with it, I would equate Spark Streaming more with Storm's Trident
API, rather than with the raw Bolt API. Even then, there are
I made the change so that I could implement top() using treeReduce(). A member
on here suggested I make the change in RDD.scala to accomplish that. Also, this
is for a research project, and not for commercial use.
So, any advice on how I can get the spark submit to use my custom built jars
Hi,
I have a spark streaming program running for ~ 25hrs. When I check the
Streaming UI tab. I found the Waiting batches is 144. But the scheduling
delay is 0. I am a bit confused.
If the waiting batches is 144, that means many batches are waiting in the
queue to be processed? If this is the
If this is research-only, and you don't want to have to worry about updating
the jars installed by default on the cluster, you can add your custom Spark jar
using the spark.driver.extraLibraryPath configuration property when running
spark-submit, and then use the experimental
Forgot to mention this is on standalone mode.
Is my configuration wrong?
Thanks,
Liming
On 15 Jun, 2015, at 11:26 pm, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
I have this in my spark-defaults.conf (same for hdfs):
spark.eventLog.enabled true
spark.eventLog.dir
The documentation says spark.driver.userClassPathFirst can only be used in
cluster mode. Does this mean I have to set the --deploy-mode option for
spark-submit to cluster? Or can I still use the default client? My
understanding is that even in the default deploy mode, spark still uses the
I am trying to submit a spark application using the command line. I used the
spark submit command for doing so. I initially setup my Spark application on
Eclipse and have been making changes on there. I recently obtained my own
version of the Spark source code and added a new method to RDD.scala.
I would suggest looking at
https://github.com/datastax/spark-cassandra-connector
On Tue, Jun 16, 2015 at 4:01 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
hi all!
is there a way to connect cassandra with jdbcRDD ?
--
View this message in context:
I just ran into this too. Thanks for the tip!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p23351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I am running a simple spark streaming application on hadoop 2.7.0/YARN
(master: yarn-client) with 2 executors in different machines. However,
while the app is running, I can see on the app web UI (tab executors) that
only 1 executor keeps completing tasks over time, the other executor only
Hi Peng,
I got exactly same error! My shuffle data is also very large. Have you
figured out a method to solve that?
Thanks,
Jia
On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote:
I'm deploying a Spark data processing job on an EC2 cluster, the job is
small
for the cluster
Hi guys: I added a parameter spark.worker.cleanup.appDataTtl 3 * 24 *
3600 in my conf/spark-default.conf, then I start my spark cluster. However I
got an exception:
15/06/16 14:25:14 INFO util.Utils: Successfully started service 'sparkWorker'
on port 43344.
15/06/16 14:25:14 ERROR
I think you have to using 604800 instead of 7 * 24 * 3600, obviously
SparkConf will not do multiplication for you..
The exception is quite obvious: Caused by: java.lang.NumberFormatException:
For input string: 3 * 24 * 3600
2015-06-16 14:52 GMT+08:00 luohui20...@sina.com:
Hi guys:
I
Hi all,
Looks like data frame parquet writing is very broken in Spark 1.4.0. We had no
problems with Spark 1.3.
When trying to save a data frame with 569610608 rows.
dfc.write.format(parquet).save(“/data/map_parquet_file)
We get random results between runs. Caching the data frame in memory
Hi Shreesh,
You can definitely decide the how many partitions your data should break
into by passing a, 'minPartition' argument in the method
sc.textFile(input/path, minPartition) and 'numSlices' arg in method
sc.parallelize(localCollection, numSlices). In fact there is always a option
to specify
Good question, with fileStream or textFileStream basically it will only
takes in the files whose timestamp is the current timestamp
Did you look inside all logs? Mesos logs and executor logs?
Thanks
Best Regards
On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote:
My Mesos cluster has 1.5 CPU and 17GB free. If I set:
conf.set(spark.mesos.coarse, true);
conf.set(spark.cores.max, 1);
in the SparkConf
Spark SQL document states:
Tables with buckets: bucket is the hash partitioning within a Hive table
partition. Spark SQL doesn’t support buckets yet
What exactly does that mean?:
- that writing to bucketed table wont respect this feature and data will
be written in not bucketed manner?
Each receiver will run on 1 core. So if your network is not the bottleneck
then to test the consumption speed of the receivers you can simply do a
*dstream.count.print* to see how many records it can receive. (Also it will
be available in the Streaming tab of the driver UI). If you spawn 10
You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.
Thanks
Best Regards
On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote:
Thanks very much, Akhil.
That solved my problem.
Best,
Rex
On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das
Whats in your executor (that .tgz file) conf/spark-default.conf file?
Thanks
Best Regards
On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden gog...@gmail.com wrote:
I'm loading these settings from a properties file:
spark.executor.memory=256M
spark.cores.max=1
spark.shuffle.consolidateFiles=true
I found if I move the partitioned columns in schemaString and in Row to
the end of the sequence, then it works correctly...
On 16. juni 2015 11:14, patcharee wrote:
Hi,
I am using spark 1.4 and HiveContext to append data into a partitioned
hive table. I found that the data insert into the
thanks saisai,I should try more times. I thought it will be caculated
automatically as the default.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Saisai Shao sai.sai.s...@gmail.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
Hi,
In SparkR shell, I invoke:
mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json,
header=false)
I have tried various filetypes (csv, txt), all fail.
RESPONSE: ERROR RBackendHandler: load on 1 failed
BELOW THE WHOLE RESPONSE:
15/06/16 08:09:13 INFO MemoryStore:
Hi,
I am using spark 1.4 and HiveContext to append data into a partitioned
hive table. I found that the data insert into the table is correct, but
the partition(folder) created is totally wrong.
Below is my code snippet
hi all!
is there a way to connect cassandra with jdbcRDD ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cassandra-with-jdbcRDD-tp23335.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
On 15 Jun 2015, at 15:43, Borja Garrido Bear
kazebo...@gmail.commailto:kazebo...@gmail.com wrote:
I tried running the job in a standalone cluster and I'm getting this:
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException:
Hi Al M,
You should try proving more main memory to shuffle process and it might
reduce spill on disk. The default configuration for shuffle memory fraction
is 20% of the safe memory that means 16% of the overall heap memory. so when
we set executor memory only a small fraction of it is used in
Which version of Spark are you using?
On Tue, Jun 16, 2015 at 6:20 AM, afarahat ayman.fara...@yahoo.com wrote:
Hello;
I have a data set of about 80 Million users and 12,000 items (very sparse
).
I can get the training part working no problem. (model has 20 factors),
However, when i try
78 matches
Mail list logo