Is there any documentation that explains how to query JSON documents using
SparkSQL?
Thanks,
Fatma
FWIW the JIRA I was thinking about is
https://issues.apache.org/jira/browse/SPARK-3098
On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you
Hi guys,
I am trying to get a better understanding of the DAG generation for a job in
Spark.
Ideally, what I want is to run some SQL query and extract the generated DAG by
Spark. By DAG I mean the stages and dependencies among stages, and the number
of tasks in every stage.
Could you guys
This was brought up again in
https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item
which was asked about the reliability of zipping RDDs. Basically, it
should be reliable, and if it is not, then it should be reported as a bug.
This general approach should work (with explicit
For those still interested, I raised this issue on JIRA and received an
official response:
https://issues.apache.org/jira/browse/SPARK-6340
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html
Sent from the Apache
Dang I can't seem to find the JIRA now but I am sure we had a discussion
with Matei about this and the conclusion was that RDD order is not
guaranteed unless a sort is involved.
On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote:
This was brought up again in
Hi Folks,
I have a situation where I am getting a version conflict between java
libraries that is used by my application and ones used by spark.
Following are the details -
I use spark provided by Cloudera running on the CDH5.3.2 cluster (Spark
1.2.0-cdh5.3.2). The library that is causing the
I am seeing extremely slow performance from Spark 1.2.1 (MAPR4) on Hadoop
2.5.1 (YARN) on hive external tables on s3n. I am running a 'select
count(*) from s3_table' query on the nodes using Hive 0.13 and Spark SQL
1.2.1.
I am running a 5 node cluster on EC2 c3.2xlarge Mapr 4.0.2 M3 cluster.
The
I dumped the trees in the random forest model, and occasionally saw a leaf
node with strange stats:
- pred=1.00 prob=0.80 imp=-1.00
Hi,
When you submit a jar to the spark cluster, it is very difficult to see the
logging. Is there any way to save the logging to a file? I mean only the
logging I created not the Spark log information.
Thanks,
David
I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you only use operations like map which
preserve partitioning, ordering should be guaranteed from what I know.
On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote:
Dang
It was not really meant to be pubic and overridden. Because anything you
want to do to generate jobs from RDDs can be done using DStream.foreachRDD
On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote:
Hi,
I am trying to create a simple subclass of DStream. If I
See this thread: http://search-hadoop.com/m/JW1q5Kk8Zs1
You can find Spark built against multiple hadoop releases in:
http://people.apache.org/~pwendell/spark-1.3.0-rc3/
FYI
On Mon, Mar 16, 2015 at 11:36 AM, Shuai Zheng szheng.c...@gmail.com wrote:
And it is an NoSuchMethodError, not a
Hi Shuai,
yup, that is exactly what I meant -- implement your own class
MyGroupingRDD. This is definitely more detail than a lot of users will
need to go, but its also not all that scary either. In this case, you want
something that is *extremely* close to the existing CoalescedRDD, so start
by
Hi Ralph,
It seems like https://issues.apache.org/jira/browse/SPARK-6299 issue, which
is I'm working on.
I submitted a PR for it, would you test it?
Regards,
Kevin
On Tue, Mar 17, 2015 at 1:11 AM Ralph Bergmann ra...@dasralph.de wrote:
Hi,
I want to try the JavaSparkPi example[1] on a
kevindahl wrote
I'm trying to create a spark data frame from a pandas data frame, but for
even the most trivial of datasets I get an error along the lines of this:
---
Py4JJavaError Traceback
Hi Xi Shen,
You could set the spark.executor.memory in the code itself . new
SparkConf()..set(spark.executor.memory, 2g)
Or you can try the -- spark.executor.memory 2g while submitting the jar.
Regards
Jishnu Prathap
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, March 16,
Hi,
Trying to run spark ( 1.2.1 built for hdp 2.2) against a yarn cluster
results in the AM failing to start with following error on stderr:
Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher
An application id was assigned to the job, but there were no logs.
Hi Sean,
My system is windows 64 bit. I looked into the resource manager, Java is
the only process that used about 13% CPU recourse; no disk activity related
to Java; only about 6GB memory used out of 56GB in total.
My system response very well. I don't think it is a system issue.
Thanks,
David
Hi David,
You can try the local-cluster.
the number in local-cluster[2,2,1024] represents that there are 2 worker, 2
cores and 1024M
Best Regards
Peng Xu
2015-03-16 19:46 GMT+08:00 Xi Shen davidshe...@gmail.com:
Hi,
In YARN mode you can specify the number of executors. I wonder if we can
I can access the manage webpage at port 8080 from my mac and it told me
that master and 1 slave is running and I can access them at port 7077
But the port scanner shows that port 8080 is open but not port 7077. I
started the port scanner on the same machine where Spark is running.
Ralph
Am
Hi,
I tried to insert into a hive partitioned table
val ZONE: Int = Integer.valueOf(args(2))
val MONTH: Int = Integer.valueOf(args(3))
val YEAR: Int = Integer.valueOf(args(4))
val weightedUVToDF = weightedUVToRecord.toDF()
weightedUVToDF.registerTempTable(speeddata)
hiveContext.sql(INSERT
MLlib supports streaming linear models:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
and k-means:
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
With an iteration parameter of 1, this amounts to mini-batch SGD where the
mini-batch is
Not quite sure whether I understand your question properly. But if you
just want to read the partition columns, it’s pretty easy. Take the
“year” column as an example, you may do this in HiveQL:
|hiveContext.sql(SELECT year FROM speed)
|
or in DataFrame DSL:
I would like to insert the table, and the value of the partition column
to be inserted must be from temporary registered table/dataframe.
Patcharee
On 16. mars 2015 15:26, Cheng Lian wrote:
Not quite sure whether I understand your question properly. But if you
just want to read the
Hi,
Thanks for the response. I understand that part. But I am asking why the
internal implementation using a subclass when it can use an existing api?
Unless there is a real difference, it feels like code smell to me.
Regards,
Madhukara Phatak
http://datamantra.io/
On Mon, Mar 16, 2015 at 2:14
1. I don't think textFile is capable of unpacking a .gz file. You need to use
hadoopFile or newAPIHadoop file for this.
Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is
compute splits on gz files, so if you have a single file, you'll have a single
partition.
You can track the issue here.
https://issues.apache.org/jira/browse/SPARK-1442
Its currently not supported, i guess the test cases are work in progress.
On Mon, Mar 16, 2015 at 12:44 PM, hseagle hsxup...@gmail.com wrote:
Hi all,
I'm wondering whether the latest spark-1.3.0 supports
Hi all.
When I specify the number of partitions and save this RDD in parquet
format, my app fail. For example
selectTest.coalesce(28).saveAsParquetFile(hdfs://vm-clusterOutput)
However, it works well if I store data in text
selectTest.coalesce(28).saveAsTextFile(hdfs://vm-clusterOutput)
My
I have a spark app which is composed of multiple files.
When I launch Spark using:
../hadoop/spark-install/bin/spark-submit main.py --py-files
/home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py
--master spark://spark-m:7077
I am getting an error:
Hi,
In YARN mode you can specify the number of executors. I wonder if we can
also start multiple executors at local, just to make the test run faster.
Thanks,
David
There are 2 cases for No space left on device:
1. Some tasks which use large temp space cannot run in any node.
2. The free space of datanodes is not balance. Some tasks which use large
temp space can not run in several nodes, but they can run in other nodes
successfully.
Because most of our
Thanks a lot. I will give a try!
On Monday, March 16, 2015, Adam Lewandowski adam.lewandow...@gmail.com
wrote:
Prior to 1.3.0, Spark has 'spark.files.userClassPathFirst' for non-yarn
apps. For 1.3.0, use 'spark.executor.userClassPathFirst'.
See
There are a number of small misunderstandings here.
In the first instance, the executor memory is not actually being set
to 2g and the default of 512m is being used. If you are writing code
to launch an app, then you are trying to duplicate what spark-submit
does, and you don't use spark-submit.
Hi Judy,
In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is
actually decided by the |InputFormat| used. And
|spark.sql.inMemoryColumnarStorage.batchSize| is not related to
partition number, it controls the in-memory columnar batch size within a
single partition.
Also, what
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353
On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
We're facing No space left on device errors lately from time to time.
The job will fail after retries. Obvious in such case, retry won't be
Thanks Sean, I forgot it
The ouput error is the following:
java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to
org.apache.spark.sql.catalyst.types.decimal.Decimal
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
at
I think you'd have to say more about stopped working. Is the GC
thrashing? does the UI respond? is the CPU busy or not?
On Mon, Mar 16, 2015 at 4:25 AM, Xi Shen davidshe...@gmail.com wrote:
Hi,
I am running k-means using Spark in local mode. My data set is about 30k
records, and I set the k =
Are you sure the master / slaves started?
Do you have network connectivity between the two?
Do you have multiple interfaces maybe?
Does debian resolve correctly and as you expect to the right host/interface?
On Mon, Mar 16, 2015 at 8:14 AM, Ralph Bergmann ra...@dasralph.de wrote:
Hi,
I try my
I wanted to ask a basic question about the types of algorithms that are
possible to apply to a DStream with Spark streaming. With Spark it is possible
to perform iterative computations on RDDs like in the gradient descent example
val points = spark.textFile(...).map(parsePoint).cache()
Hi Shuai,
It should certainly be possible to do it that way, but I would recommend
against it. If you look at HadoopRDD, its doing all sorts of little
book-keeping that you would most likely want to mimic. eg., tracking the
number of bytes records that are read, setting up all the hadoop
Hi all,
I am trying to use the new ALS implementation under
org.apache.spark.ml.recommendation.ALS.
The new method to invoke for training seems to be override def fit(dataset:
DataFrame, paramMap: ParamMap): ALSModel.
How do I create a dataframe object from ratings data set that is on hdfs ?
Hello Ted,
Yes, I can understand what you are suggesting. But I am unable to decipher
where I am going wrong, could you please point out what are the locations
to be looked at to be able to find and correct the mistake?
I greatly appreciate your help!
On Sun, Mar 15, 2015 at 1:10 PM, Ted Yu
Hi,
I want to try the JavaSparkPi example[1] on a remote Spark server but I
get a ClassNotFoundException.
When I run it local it works but not remote.
I added the spark-core lib as dependency. Do I need more?
Any ideas?
Thanks Ralph
[1] ...
Hey Masf,
I’ve created SPARK-6360
https://issues.apache.org/jira/browse/SPARK-6360 to track this issue.
Detailed analysis is provided there. The TL;DR is, for Spark 1.1 and
1.2, if a SchemaRDD contains decimal or UDT column(s), after applying
any traditional RDD transformations (e.g.
Hi All,
I am running Spark 1.2.1 and AWS SDK. To make sure AWS compatible on the
httpclient 4.2 (which I assume spark use?), I have already downgrade to the
version 1.9.0
But even that, I still got an error:
Exception in thread main java.lang.NoSuchMethodError:
Okay I think I found the mistake
The Eclipse Maven plug suggested me version 1.2.1 of the spark-core lib
but I use Spark 1.3.0
As I fixed it I can access the Spark server.
Ralph
Am 16.03.15 um 14:39 schrieb Ralph Bergmann:
I can access the manage webpage at port 8080 from my mac and it told
Hi Shuai,
On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com wrote:
Sorry I response late.
Zhan Zhang's solution is very interesting and I look at into it, but it is
not what I want. Basically I want to run the job sequentially and also gain
parallelism. So if possible, if
Hi
Current all the jobs in spark gets submitted using queue . i have a
requirement where submitted job will generate another set of jobs with some
priority , which should again be submitted to spark cluster based on
priority ? Means job with higher priority should be executed first,Is
it
Hi Todd,
Thanks for the help. I'll try again after building a distribution with the
1.3 sources. However, I wanted to confirm what I mentioned earlier: is it
sufficient to copy the distribution only to the client host from where
spark-submit is invoked(with spark.yarn.jar set), or is there a
Thanks Shixiong!
Very strange that our tasks were retried on the same executor again and
again. I'll check spark.scheduler.executorTaskBlacklistTime.
Jianshi
On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu zsxw...@gmail.com wrote:
There are 2 cases for No space left on device:
1. Some tasks
I see. Since all Spark SQL queries must be issued from the driver side,
you'll have to first collect all interested values to the driver side,
and then use them to compose one or more insert statements.
Cheng
On 3/16/15 10:33 PM, patcharee wrote:
I would like to insert the table, and the
Hi Bharath,
I ran into the same issue a few days ago, here is a link to a post on
Horton's fourm. http://hortonworks.com/community/forums/search/spark+1.2.1/
Incase anyone else needs to perform this these are the steps I took to get
it to work with Spark 1.2.1 as well as Spark 1.3.0-RC3:
1.
You probably want to update this line as follows:
lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)
For more details on why, see this answer
http://stackoverflow.com/a/27631722/877069.
Nick
On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote:
1. I
Oh, by default it's set to 0L.
I'll try setting it to 3 immediately. Thanks for the help!
Jianshi
On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Thanks Shixiong!
Very strange that our tasks were retried on the same executor again and
again. I'll check
Try increasing the driver memory. We store trees on the driver node.
If maxDepth=20 and numTrees=50, you may need a large driver memory to
store all tree models. You might want to start with a smaller maxDepth
and then increase it and see whether deep trees really help (vs. the
cost). -Xiangrui
https://issues.apache.org/jira/browse/SPARK-5954 is for this issue and
Shuo is working on it. We will first implement topByKey for RDD and
them we could add it to DataFrames. -Xiangrui
On Mon, Mar 9, 2015 at 9:43 PM, Moss rhoud...@gmail.com wrote:
I do have a schemaRDD where I want to group by
It's mostly for legacy reasons. First we had added all the MappedDStream,
etc. and then later we realized we need to expose something that is more
generic for arbitrary RDD-RDD transformations. It can be easily replaced.
However, there is a slight value in having MappedDStream, for developers to
Hi, everyone,
I'm wondering whether there is a possibility to setup an official IRC
channel on freenode.
I noticed that a lot of apache projects would have a such channel to let
people talk directly.
Best
Michael
Hi,
I'm very new to Spark and GraphX. I downloaded and configured Spark on a
cluster, which uses Hadoop 1.x. The master UI shows all workers. The
example command run-example SparkPi works fine and completes
successfully.
I'm interested in GraphX. Although the documentation says it is built-in
Hi Akhil,
Yes, I did change both versions on the project and the cluster. Any clues?
Even the sample code from Spark website failed to work.
Thanks,
Eason
On Sun, Mar 15, 2015 at 11:56 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Did you change both the versions? The one in your build
From my local maven repo:
$ jar tvf
~/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar
| grep SchemeRegistry
1373 Fri Apr 19 18:19:36 PDT 2013
org/apache/http/impl/conn/SchemeRegistryFactory.class
2954 Fri Apr 19 18:19:36 PDT 2013
Hi All,
I wrote out a complex parquet file from spark sql and now I am trying to put
a hive table on top. I am running into issues with creating the hive table
itself. Here is the json that I wrote out to parquet using spark sql:
textFileStream and default fileStream recognizes the compressed
xml(.xml.gz) files.
Each line in the xml file is an element in RDD[string].
Then whole RDD is converted to a proper xml format data and stored in a *Scala
variable*.
- I believe storing huge data in a *Scala variable* is
Hi,
Maybe this is what you are looking for :
http://spark.apache.org/docs/1.2.0/job-scheduling.html#fair-scheduler-pools
Thanks,
On Mon, Mar 16, 2015 at 8:15 PM, abhi abhishek...@gmail.com wrote:
Hi
Current all the jobs in spark gets submitted using queue . i have a
requirement where
I don't see non-serializable objects in the provided snippets. But you
can always add -Dsun.io.serialization.extendedDebugInfo=true to Java
options to debug serialization errors.
Cheng
On 3/17/15 12:43 PM, anu wrote:
Spark Version - 1.1.0
Scala - 2.10.4
I have loaded following type data
If i understand correctly , the above document creates pool for priority
which is static in nature and has to be defined before submitting the job .
.in my scenario each generated task can have different priority.
Thanks,
Abhi
On Mon, Mar 16, 2015 at 9:48 PM, twinkle sachdeva
http://apache-spark-developers-list.1001551.n3.nabble.com/Job-priority-td10076.html#a10079
On Mon, Mar 16, 2015 at 10:26 PM, abhi abhishek...@gmail.com wrote:
If i understand correctly , the above document creates pool for priority
which is static in nature and has to be defined before
yes .
Each generated job can have a different priority it is like a recursive
function, where in each iteration generate job will be submitted to the
spark cluster based on the priority. jobs will lower priority or less than
some threshold will be discarded.
Thanks,
Abhi
On Mon, Mar 16, 2015
Spark Version - 1.1.0
Scala - 2.10.4
I have loaded following type data from a parquet file, stored in a schemaRDD
[7654321,2015-01-01 00:00:00.007,0.49,THU]
Since, in spark version 1.1.0, parquet format doesn't support saving
timestamp valuues, I have saved the timestamp data as string. Can you
Still no luck running purpose-built 1.3 against HDP 2.2 after following all
the instructions. Anyone else faced this issue?
On Mon, Mar 16, 2015 at 8:53 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:
Hi Todd,
Thanks for the help. I'll try again after building a distribution with the
1.3
Each RDD has multiple partitions, each of them will produce one hdfs file when
saving output. I don’t think you are allowed to have multiple file handler
writing to the same hdfs file. You still can load multiple files into hive
tables, right?
Thanks..
Zhan Zhang
On Mar 15, 2015, at 7:31
Hello,
I am new to spark streaming API.
I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on
streaming data? Currently I am using forecahRDD for parsing through DStream
and I am generating a model based on each RDD. Am I doing anything
logically wrong here?
Thank you.
Sample
The programming guide has a short example:
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.
Note that once you infer a schema for a JSON dataset, you can also use nested
path notation
Hi Imran,
I am a bit confused here. Assume I have RDD a with 1000 partition and also has
been sorted. How can I control when creating RDD b (with 20 partitions) to make
sure 1-50 partition of RDD a map to 1st partition of RDD b? I don’t see any
control code/logic here?
You code below:
Hi,
I've been trying to run a simple SparkWordCount app on EC2, but it looks
like my apps are not succeeding/completing. I'm suspecting some sort of
communication issue. I used the SparkWordCount app from
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/
Tnx for the workaround.
Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
+372 51 480
On 16/03/15 06:20, Jeremy Freeman wrote:
Hi Margus, thanks for reporting this, I’ve been able to reproduce and
there does indeed appear to be a bug. I’ve created a JIRA and have a
fix ready, can
One approach would be, If you are using fileStream you can access the
individual filenames from the partitions and with that filename you can
apply your uncompression logic/parsing logic and get it done.
Like:
UnionPartition upp = (UnionPartition)
ds.values().getPartitions()[i];
Did you change both the versions? The one in your build file of your
project and the spark version of your cluster?
Thanks
Best Regards
On Sat, Mar 14, 2015 at 6:47 AM, EH eas...@gmail.com wrote:
Hi all,
I've been using Spark 1.1.0 for a while, and now would like to upgrade to
Spark 1.1.1
Not sure if this will help, but can you try setting the following:
set(spark.core.connection.ack.wait.timeout,6000)
Thanks
Best Regards
On Sat, Mar 14, 2015 at 4:08 AM, Chen Song chen.song...@gmail.com wrote:
When I ran Spark SQL query (a simple group by query) via hive support, I
have seen
If you want more partitions then you have specify it as:
Rdd.groupByKey(*10*).mapValues...
I think if you don't specify anything, the # partitions will be the #
cores that you have for processing.
Thanks
Best Regards
On Sat, Mar 14, 2015 at 12:28 AM, Adrian Mocanu amoc...@verticalscope.com
Hello,
So actually solved the problem...see point 3.
Here are a few approaches/errors I was getting:
1) mvn package exec:java -Dexec.mainClass=HelloWorld
Error: java.lang.ClassNotFoundException: HelloWorld
2)
which location should i need to specify the classpath exactly .
Thanks,
On Mon, Mar 16, 2015 at 12:52 PM, Cheng, Hao hao.ch...@intel.com wrote:
It doesn’t take effect if just putting jar files under the
lib-managed/jars folder, you need to put that under class path explicitly.
*From:*
Dibyendu,
Thanks for the reply.
I am reading your project homepage now.
One quick question I care about is:
If the receivers failed for some reasons(for example, killed brutally by
someone else), is there any mechanism for the receiver to fail over
automatically?
On Mon, Mar 16, 2015 at 3:25
Hi All,
Processing streaming JSON files with Spark features (Spark streaming and
Spark SQL), is very efficient and works like a charm.
Below is the code snippet to process JSON files.
windowDStream.foreachRDD(IncomingFiles = {
val IncomingFilesTable =
Hi Fightfate,
I have attached my hive-site.xml file in the previous mail.Please check the
configuration once. In hive i am able to create tables and also able to
load data into hive table.
Please find the attached file.
Regards,
Sandeep.v
On Mon, Mar 16, 2015 at 11:34 AM, fightf...@163.com
Hi,
I am trying to create a simple subclass of DStream. If I understand
correctly, I should override *compute *lazy operations and *generateJob*
for actions. But when I try to override, generateJob it gives error saying
method is private to the streaming package. Is my approach is correct or am
Or you need to specify the jars either in configuration or
bin/spark-sql --jars mysql-connector-xx.jar
From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Monday, March 16, 2015 2:04 PM
To: sandeep vura; Ted Yu
Cc: user
Subject: Re: Re: Unable to instantiate
I have already added mysql-connector-xx.jar file in spark/lib-managed/jars
directory.
Regards,
Sandeep.v
On Mon, Mar 16, 2015 at 11:48 AM, Cheng, Hao hao.ch...@intel.com wrote:
Or you need to specify the jars either in configuration or
bin/spark-sql --jars mysql-connector-xx.jar
Hi,
Internally Spark uses HDFS api to handle file data. Have a look at HAR,
Sequence file input format. More information on this cloudera blog
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
Regards,
Madhukara Phatak
http://datamantra.io/
On Sun, Mar 15, 2015 at 9:59 PM, Pat
If you use fileStream, there's an option to filter out files. In your case
you can easily create a filter to remove _temporary files. In that case,
you will have to move your codes inside foreachRDD of the dstream since the
application will become a streaming app.
Thanks
Best Regards
On Sat, Mar
Guys,
We have a project which builds upon Spark streaming.
We use Kafka as the input stream, and create 5 receivers.
When this application runs for around 90 hour, all the 5 receivers failed
for some unknown reasons.
In my understanding, it is not guaranteed that Spark streaming receiver
will
Hi all,
I'm wondering whether the latest spark-1.3.0 supports the windowing and
analytic funtions in hive, such as row_number, rank and etc.
Indeed, I've done some testing by using spark-shell and found that
row_number is not supported yet.
But I still found that there were
Hi,
I have set spark.executor.memory to 2048m, and in the UI Environment
page, I can see this value has been set correctly. But in the Executors
page, I saw there's only 1 executor and its memory is 265.4MB. Very strange
value. why not 256MB, or just as what I set?
What am I missing here?
How are you setting it? and how are you submitting the job?
Thanks
Best Regards
On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote:
Hi,
I have set spark.executor.memory to 2048m, and in the UI Environment
page, I can see this value has been set correctly. But in the
Akhil,
I have checked the logs. There isn't any clue as to why the 5 receivers
failed.
That's why I just take it for granted that it will be a common issue for
receiver failures, and we need to figure out a way to detect this kind of
failure and do fail-over.
Thanks
On Mon, Mar 16, 2015 at
How many threads are you allocating while creating the sparkContext? like
local[4] will allocate 4 threads. You can try increasing it to a higher
number also try setting level of parallelism to a higher number.
Thanks
Best Regards
On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen davidshe...@gmail.com
You need to figure out why the receivers failed in the first place. Look in
your worker logs and see what really happened. When you run a streaming job
continuously for longer period mostly there'll be a lot of logs (you can
enable log rotation etc.) and if you are doing a groupBy, join, etc type
It doesn’t take effect if just putting jar files under the lib-managed/jars
folder, you need to put that under class path explicitly.
From: sandeep vura [mailto:sandeepv...@gmail.com]
Sent: Monday, March 16, 2015 2:21 PM
To: Cheng, Hao
Cc: fightf...@163.com; Ted Yu; user
Subject: Re: Re: Unable
Which version of Spark you are running ?
You can try this Low Level Consumer :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
This is designed to recover from various failures and have very good fault
recovery mechanism built in. This is being used by many users and at
present
1 - 100 of 135 matches
Mail list logo