Hi
How to ensure in spark streaming 1.3 with kafka that when an application is
killed , last running batch is fully processed and offsets are written to
checkpointing dir.
On Fri, Aug 7, 2015 at 8:56 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Hi
I am using spark stream 1.3 and using
Hi,
Using spark 1.4.0 in standalone mode, with following configuration:
SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl=86400
cleanup interval is set to default.
Application files are not deleted.
Using JavaSparkContext, and when the application ends it
Hi
When i tried to stop spark streaming using ssc.stop(false,true) It gives the
following error.
ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
15/08/07 13:41:11 WARN ReceiverSupervisorImpl: Stopped executor without
error
15/08/07 13:41:20 WARN WriteAheadLogManager :
Are you setting SPARK_PREPEND_CLASSES? try to disable it. Here your uber
jar which does not have the SparkConf is put in the first place of the
class-path which is messing it up.
Thanks
Best Regards
On Thu, Aug 6, 2015 at 5:48 PM, Stephen Boesch java...@gmail.com wrote:
Given the following
Did you try this way?
export HIVE_SERVER2_THRIFT_PORT=6066
./sbin/start-thriftserver.sh --master master-uri
export HIVE_SERVER2_THRIFT_PORT=6067
./sbin/start-thriftserver.sh --master master-uri
You just have to change HIVE_SERVER2_THRIFT_PORT to instantiate multiple
servers i think.
Thanks
Just make sure your hadoop instances are functioning properly, (check for
ResourceManager, NodeManager). How are you submitting the job? If that is
getting submitted then you can look further in the yarn logs to see whats
really going on.
Thanks
Best Regards
On Thu, Aug 6, 2015 at 6:59 PM, Clint
Which version of spark are you using? Looks like you are hitting the file
handles. In that case you might want to increase the ulimit. You can
actually validate this by looking in the worker logs (which would probably
say Too many open files exception).
Thanks
Best Regards
On Thu, Aug 6, 2015 at
Can you look more in the worker logs and see whats going on? It looks like
a memory issue (kind of GC overhead etc., You need to look in the worker
logs)
Thanks
Best Regards
On Fri, Aug 7, 2015 at 3:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Re attaching the images.
On Thu, Aug 6, 2015
Hi there,
I have a problem with a spark streaming job running on Spark 1.4.1, that
appends to parquet table.
My job receives json strings and creates JsonRdd out of it. The jsons might
come in different shape as most of the fields are optional. But they never
have conflicting schemas.
Next, for
Hi,
I'd like to start a service with each Spark Executor upon initalization and
have the disributed code reference that service locally.
What I'm trying to do is to invoke torch7 computations without reloading
the model for each row by starting Lua http handler that will recieve http
requests for
Did you try this way?
/usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf
spark.mesos.executor.docker.image=docker.repo/spark:latest --class
org.apache.spark.examples.SparkPi *--jars hdfs://hdfs1/tmp/spark-*
*examples-1.4.1-hadoop2.6.0-**cdh5.4.4.jar* 100
Thanks
Best
I'm not sure what you are upto, but if you can explain what you are trying
to achieve then may be we can restructure your code. On a quick glance i
could see :
tweetsRDD*.collect()*.map(tweet=
DBQuery.saveTweets(tweet))
Which will bring the whole data into your driver machine and it would
Depends on which operation you are doing, If you are doing a .count() on a
parquet, it might not download the entire file i think, but if you do a
.count() on a normal text file it might pull the entire file.
Thanks
Best Regards
On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya aara...@gmail.com
Why not give it a shot? Spark always outruns old mapreduce jobs.
Thanks
Best Regards
On Sat, Aug 8, 2015 at 8:25 AM, linlma lin...@gmail.com wrote:
I have a tens of million records, which is customer ID and city ID pair.
There are tens of millions of unique customer ID, and only a few
It seems, it is not able to pick up the debug parameters. You can actually
set export
_JAVA_OPTIONS=-agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=y
and then submit the job to enable debugging.
Thanks
Best Regards
On Fri, Aug 7, 2015 at 10:20 PM, Benjamin Ross
Hi,
I might have a stupid question about sparksql's implementation of join on
not equality conditions, for instance condition1 or condition2.
In fact, Hive doesn't support such join, as it is very difficult to express
such conditions as a map/reduce job. However, sparksql supports such
You can create your own data schema (StructType in spark), and use
following method to create data frame with your own data schema:
sqlContext.createDataFrame(yourRDD, structType);
I wrote a post on how to do it. You can also get the sample code there:
Light-Weight Self-Service Data Query
Also, Spark on Mesos supports cluster mode:
http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler
In stream application how many times the map transformation object being
created?
Say I have
directKafkaStream.repartition(numPartitions).mapPartitions
(new FlatMapFunction_derivedclass(configs));
class FlatMapFunction_derivedclass{
FlatMapFunction_derivedclass(Config config){
}
@Override
Can you check if there is protobuf version other than 2.5.0 on the
classpath ?
Please show the complete stack trace.
Cheers
On Sun, Aug 9, 2015 at 9:41 AM, longda...@163.com longda...@163.com wrote:
hi all,
i compile spark-1.3.1 on linux use intellij14 and got error assertion
failed:
the stack trace is below
Error:scalac:
while compiling:
/home/xiaoju/data/spark-1.3.1/core/src/main/scala/org/apache/spark/SparkContext.scala
during phase: typer
library version: version 2.10.4
compiler version: version 2.10.4
reconstructed args: -nobootcp
Hi Akshat,
Is there a particular reason you don't use s3a? From my experience,s3a performs
much better than the rest. I believe the inefficiency is from the
implementation of the s3 interface.
Best Regards,
Jerry
Sent from my iPhone
On 9 Aug, 2015, at 5:48 am, Akhil Das
Besides finding to this problem, I think I can workaround at least the
WARNING message by overwriting parquet variable:
parquet.enable.summary-metadata
That according to this PARQUET-107
https://issues.apache.org/jira/browse/PARQUET-107 ticket can be used to
disable writing summary file which is
hi all,
i compile spark-1.3.1 on linux use intellij14 and got error assertion
failed: com.google.protobuf.InvalidProtocalBufferException, how could i
solve the problem?
--
View this message in context:
Hi Bo I know how to create a DataFrame my question is how to create a
DataFrame for binary files and in your blog it is raw text json files
please read my question properly thanks.
On Sun, Aug 9, 2015 at 11:21 PM, bo yang bobyan...@gmail.com wrote:
You can create your own data schema
thank you for reply,
i use sbt to complie spark, but there are both protobuf 2.4.1 and 2.5.0 in
maven repository , and protobuf 2.5.0 in .ivy repository,
the stack trace is below
Error:scalac:
while compiling:
Did you try this way?
/usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf
spark.mesos.executor.docker.image=docker.repo/spark:latest --class
org.apache.spark.examples.SparkPi --jars
hdfs://hdfs1/tmp/spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar 100
I did, and got
Well, my post uses raw text json file to show how to create data frame with
a custom data schema. The key idea is to show the flexibility to deal with
any format of data by using your own schema. Sorry if I did not make you
fully understand.
Anyway, let us know once you figure out your problem.
The conflicting metadata values warning is a known issue
https://issues.apache.org/jira/browse/PARQUET-194
The option parquet.enable.summary-metadata is a Hadoop option rather
than a Spark option, so you need to either add it to your Hadoop
configuration file(s) or add it via
I used to maintain Luigi at Spotify, and got some insight in workflow
manager characteristics and production behaviour in the process.
I am evaluating options for my current employer, and the short list is
basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The
latter is not
I'm trying to write a simple job for Pyspark 1.4 migrating data from MySQL
to Cassandra. I can work with either the MySQL JDBC jar or the cassandra
jar separately without issue, but when I try to reference both of them it
throws an exception:
Py4JJavaError: An error occurred while calling
I'm trying to set up iPython notebook on an edge node with port forwarding so
I can run pyspark off my laptop's browser. I've mostly been following the
Cloudera guide here:
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
I got this working on one cluster
I tried running mllib k-means with 20newsgroups data set from sklearn. On a
5000 document data set I get one cluster with most of the documents and
other clusters just have handful of documents.
#code
newsgroups_train =
fetch_20newsgroups(subset='train',random_state=1,remove=('headers',
starting is easy, just use a lazy val. stopping is harder. i do not think
executors have a cleanup hook currently...
On Sun, Aug 9, 2015 at 5:29 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'd like to start a service with each Spark Executor upon initalization
and have the
Hi Akshat,
I find some open source library which implements S3 InputFormat for Hadoop.
Then I use Spark newAPIHadoopRDD to load data via that S3 InputFormat.
The open source library is https://github.com/ATLANTBH/emr-s3-io. It is a
little old. I look inside it and make some changes. Then it
35 matches
Mail list logo