I checked HBase UI. Well, this table is not completely evenly spread across
the nodes, but I think to some extent it can be seen as nearly evenly
spread - at least there is not a single node which has too many regions.
Here is a screenshot of HBase UI
I also had similar problem while joining a dataset. After digging into the
worker logs i figured out it was throwing CancelledKeyException, Not sure
the cause.
Thanks
Best Regards
On Tue, Sep 30, 2014 at 5:15 AM, jamborta jambo...@gmail.com wrote:
hi all,
I have a problem with my application
Hi, Liquan, thanks for the response.
In your example, I think the hash table should be built on the right side, so
Spark can iterate through the left side and find matches in the right side from
the hash table efficiently. Please comment and suggest, thanks again!
Hi,
I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm having some issues with the SQL API, in
particular in what the DataTypes translate to.
1. A SchemaRDD is composed of a Row and StructType - I'm using the latter to decompose a Row into primitives. I'm not
clear
Hi All,
I have a problem that i would like to consult about spark streaming.
I have a spark streaming application that parse a file (which will be
growing as time passed by)This file contains several columns containing
lines of numbers,
these parsing is divided into windows (each 1 minute). Each
We currently choose not to run jobs on yarn, so I stop trying this.
Anyway thanks for you guys' suggestions.
At least, your solutions may help people who must run their jobs on yarn : )
--
View this message in context:
Hi Haopu,
How about full outer join? One hash table may not be efficient for this
case.
Liquan
On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:
Hi, Liquan, thanks for the response.
In your example, I think the hash table should be built on the right
side, so
I am new to the spark. I am trying to implement the spark streaming from
the kafka topic.
It worked fine for some time. but some time later it started throwing the
below error. I am not getting any clue what causing the issues.
java.lang.Exception: Could not compute split, block
Hi again,
Just FYI, I found the mistake in my code regarding restartability of spark
streaming: I had a method providing a context (either retrieved from
checkpoint or, if no checkpoint available, built anew) and was building
then starting a stream on it.
The mistake is that we should not build
Hi All,
I have a requirement to process a set of files in parallel. So I'm
submitting spark jobs using java's ExecutorService. But when I do this way,
1 or more jobs are failing with status as EXITED. Earlier I tried with a
standalone spark cluster setting the job scheduling to Fair Scheduling. I
As a follow up to my own question, I see that the FlumeBatchFetcher
https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeBatchFetcher.scala
ack's the batch only after it calls Receiver.store(...).
So my question is: does store()
I would like to know a way for not adding those $_folder$ files to S3 as
well. I can go ahead and delete them but it would be nice if Spark handles
this for you.
--
View this message in context:
Those files are created by the Hadoop API that Spark leverages. Spark does
not directly control that.
You may be able to check with the Hadoop project on whether they are
looking at changing this behavior. I believe it was introduced because S3
at one point required it, though it doesn't anymore.
Any ideas guys?
Trying to find some information online. Not much luck so far.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html
Sent from the Apache Spark User
Hi,
what is the correct scala code to register an Array of this private spark
class to Kryo?
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.util.collection.CompactBuffer[]
Note: To register this class use:
Not sure if this is what you are after but its based on a moving average
within spark... I was building an ARIMA model on top of spark and this
helped me out a lot:
http://stackoverflow.com/questions/23402303/apache-spark-moving-average
ᐧ
*JIMMY MCERLAIN*
DATA SCIENTIST (NERD)
*. . . . . .
I have ran into the same issue. I understand with the new assembly built
with -Phive, I can run a spark job in yarn-cluster mode. But is there a way
for me to run spark-shell with support of hive?
I tried to add the new assembly jar with --driver-library-path
and --driver-class-path but neither
Thanks for your response Burak it was very helpful.
I am noticing that if I run PCA before KMeans that the KMeans algorithm will
actually take longer to run than if I had just run KMeans without PCA. I was
hoping that by using PCA first it would actually speed up the KMeans
algorithm.
I have
Sorry to ask another basic question.
Could you point out what I should read to setup a pseudo-distributed
Hadoop,Mahout and Spark cluster ? Does it really need something like CDH ?
I want to access Mahout and Spark output and display in Play(outside CDH). I
also want to access Spark output from
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to
We have installed spark 1.1 stand alone master mode. when we are trying to
access parquet format table. we are getting below error and one of the field
is defined as timestamp. Based on information provided in apache.spark.org
spark supports parquet and pretty much all hive datatypes including
code snippet in short:
hiveContext.sql(*CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name
String, age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'*);
I have the same problem! I start the same job 3 or 4 times again, it depends
how big the data and the cluster are. The runtime goes down in the following
jobs. And at the end I get the Fetch failure error and at this point I must
restart the spark shell and everything works well again. And I don't
Spark 1.1 comes with Hive 0.12 and Hive 0.12, for parquet format, doesn't
support timestamp datatype.
https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Limitations
https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Limitations
--
View this message in context:
Hi,
We have a cluster setup with spark 1.0.2 running 4 workers and 1 master
with 64G RAM for each. In the sparkContext we specify 32G executor memory.
However, as long as the task running longer than approximate 15 mins, all
the executors are lost just like some sort of timeout no matter if the
Thank you Nilesh originally tables was created in impala same table trying to
access through spark-sql. Any idea what is the best file format we should be
using running spark jobs if we have date field?
--
View this message in context:
Hi,
I'm trying to use Matrix Factorization over a dataset with like 6.5M users,
2.5M products and 120M ratings over products. The test is done in standalone
mode, with unique worker (Quad-core and 16 Gb RAM).
The program runs out of memory, and I think that this happens because
flatMap holds
Hi All,Can someone please me to the documentation that describes how missing
value imputation is done in MLLib. Also, any information of how this fits in
the overall roadmap will be great.
I've been trying to get the cassandra_inputformat.py and
cassandra_outputformat.py examples running for the past half day. I am
running cassandra21 community from datastax on a single node (in my dev
environment) with spark-1.1.0-bin-hadoop2.4.
I can connect and use cassandra via cqlsh and I can
I think this problem has been fixed after the 1.1 release. Can you try the
master branch?
On Mon, Sep 29, 2014 at 10:06 PM, vdiwakar.malladi
vdiwakar.mall...@gmail.com wrote:
I'm using the latest version i.e. Spark 1.1.0
Thanks.
--
View this message in context:
You may need a cluster with more memory. The current ALS
implementation constructs all subproblems in memory. With rank=10,
that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings
need 2GB, not counting the overhead. ALS creates in/out blocks to
optimize the computation, which takes
We don't handle missing value imputation in the current version of
MLlib. In future releases, we can store feature information in the
dataset metadata, which may store the default value to replace missing
values. But no one is committed to work on this feature. For now, you
can filter out examples
Hi
I have a simple streaming app. All I want to do is figure out how many lines
I have received in the current mini batch. If numLines was a JavaRDD I could
simply call count(). How do you do something similar in Streaming?
Here is my psudo code
JavaDStreamString msg =
Hi Andy
I'm new to Spark and have been working with Scala not Java but I see
there's a dstream() method to convert from JavaDStream to DStream. Then within
DStream
http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/api/java/org/apache/spark/streaming/dstream/DStream.html
there is a
Hi Gary,
I gave this a shot on a test cluster of CDH4.7 and actually saw a
regression in performance when running the numbers. Have you done any
benchmarking? Below are my numbers:
Experimental method:
1. Write 14GB of data to HDFS via [1]
2. Read data multiple times via [2]
*Experiment 1:
Hi,
I am trying to compute the number of unique users from a year's worth of
data. So there are about 300 files and each file is quite large (~GB). I
first tried this without a loop by reading all the files in the directory
using the glob pattern: sc.textFile(dir/*). But the tasks were
Hi Maddenpj,
Right now the best estimate I've heard for the open file limit is that
you'll need the square of the largest partition count in your dataset.
I filed a ticket to log the ulimit value when it's too low at
https://issues.apache.org/jira/browse/SPARK-3750
On Mon, Sep 29, 2014 at 6:20
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
Most likely it is the Hadoop 1 vs Hadoop 2 issue. The example was given for
Hadoop 1 (default Hadoop version for Spark). You may try to set the output
format class in conf for
Hello Folks,
I have been trying to implement a tree reduction algorithm recently in
spark but could not find suitable parallel operations. Assuming I have a
general tree like the following -
I have to do the following -
1) Do some computation at each leaf node to get an array of doubles.(This
Hi Andrew and Gary,
I've done some experimentation with this and had similar results. I can't
explain the speedup in write performance, but I dug into the read slowdown
and found that enabling short-circuit reads results in Hadoop not doing
read-ahead in the same way. At a high level, when SCR
Thanks, that worked! I downloaded the version pre-built against hadoop1 and
the examples worked.
- David
On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang kzh...@apache.org wrote:
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
Hi Boromir,
Assuming the tree fits in memory, and what you want to do is parallelize
the computation, the 'obvious' way is the following:
* broadcast the tree T to each worker (ok since it fits in memory)
* construct an RDD for the deepest level - each element in the RDD is
(parent,data_at_node)
You can use sc.wholeTextFiles to read a directory of text file.
Also, it seems from your code that you are only interested in the current
year's count, you can perform a filter before distinct() and perform a
reduce to sum up counts.
Hope this helps!
Liquan
On Tue, Sep 30, 2014 at 1:59 PM, SK
Is this the logs of the worker where the failure occurs? I think issues
similar to these have since been solved in later versions of Spark.
TD
On Tue, Sep 30, 2014 at 11:33 AM, Shaikh Riyaz shaikh@gmail.com wrote:
Dear All,
We are using Spark streaming version 1.0.0 in our Cloudea Hadoop
Hi,
I am new to Spark. I have written custom rabbit mq receiver which calls the
store method of Receiver interface. I can see that the block is being
stored. I am trying to process each rdd in the dstream using the foreach
function, but am unable to figure out why this block is not getting invoked
Hi Jon
Thanks, foreachRDD seems to work. I am running on a 4 machine cluster. Its
seems like Function executed by foreachRDD is running on my driver. I used
logging to check. This is exactly what I want. I need to write my final
results back to stdout so RDD.pipe() will work. I do not have any
If the tree is too big build it on graphxbut it will need thorough
analysis so that the partitions are well balanced...
On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote:
Hi Boromir,
Assuming the tree fits in memory, and what you want to do is parallelize
the
I have a 3 nodes ec2, each assigned 18G for the spark-executor-mem, So after
I run my spark batch job, I got two rdd from different forks, but with the
exact same format. And when i perform union operations, I got executors
disassociate error and the whole spark job fail and quit. Memory shouldn't
Hi,
Is there a guidance about for a data of certain data size, how much total
memory should be needed to achieve a relatively good speed?
I have a data of around 200 GB and the current total memory for my 8
machines are around 120 GB. Is that too small to run the data of this big?
Even the read
does union function cause any data shuffling?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442p15444.html
Sent from the Apache Spark User List
19:02:45,963 INFO [org.apache.spark.MapOutputTrackerMaster]
(spark-akka.actor.default-dispatcher-14) Size of output statuses for shuffle
1 is 216 bytes
19:02:45,964 INFO [org.apache.spark.scheduler.DAGScheduler]
(spark-akka.actor.default-dispatcher-14) Got job 5 (getCallSite at null:-1)
with
Hi TD,
Thanks for your reply.
Attachment in previous email was from Master.
Below is the log message from one of the worker.
---
2014-10-01 01:49:22,348 ERROR akka.remote.EndpointWriter: AssociationError
Hi,
By default, 60% of JVM memory is reserved for RDD caching, so in your case,
72GB memory is available for RDDs which means that your total data may fit
in memory. You can check the RDD memory statistics via the storage tab in
web ui.
Hope this helps!
Liquan
On Tue, Sep 30, 2014 at 4:11 PM,
Only fit the data in memory where you want to run the iterative
algorithm
For map-reduce operations, it's better not to cache if you have a memory
crunch...
Also schedule the persist and unpersist such that you utilize the RAM
well...
On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei
This sample below is essentially word count modified to be big data by
turning lines into groups of
upper case letters and then generating all case variants - it is modeled
after some real problems in biology
The issue is I know how to do this in Hadoop but in Spark the use of a List
in my flatmap
Hi,
I’m trying to save about a million of lines containing statistics data,
something like:
233815212529_10152316612422530 233815212529_10152316612422530 1328569332
1404691200 1404691200 1402316275 46 0 0 7
0 0 0
Thanks for the research Kay!
It does seem addressed, and hopefully fixed in that ticket conversation
also in https://issues.apache.org/jira/browse/HDFS-4697 So the best thing
here is to wait to upgrade to a version of Hadoop that has that fix and
then repeating the test right now. That will be
It would help to turn on debug level logging in log4j and see the logs.
Just looking at the error logs is not giving me any sense. :(
TD
On Tue, Sep 30, 2014 at 4:30 PM, Shaikh Riyaz shaikh@gmail.com wrote:
Hi TD,
Thanks for your reply.
Attachment in previous email was from Master.
Can you launch a job which exercises TableInputFormat on the same table
without using Spark ?
This would show whether the slowdown is in HBase code or somewhere else.
Cheers
On Mon, Sep 29, 2014 at 11:40 PM, Tao Xiao xiaotao.cs@gmail.com wrote:
I checked HBase UI. Well, this table is not
I somehow missed this. Do you still have problem? You probably didn't
specify the correct spark-examples jar using --driver-class-path. See the
following for an example.
MASTER=local ./bin/spark-submit --driver-class-path
./examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar
Hi All,
A noob question:
How to get SparckContext inside mapPartitions?
Example:
Let's say I have rddObjects that can be split into different partitions to be
assigned to multiple executors, to speed up the export data from database.
Variable sc is created in the main program using these
I am experiencing significant logging spam when running PySpark in IPython
Notebok
Exhibit A: http://i.imgur.com/BDP0R2U.png
I have taken into consideration advice from:
http://apache-spark-user-list.1001560.n3.nabble.com/Disable-all-spark-logging-td1960.html
also
Hi,
I am new to SparkSQL.
I want to read the specified columns from the parquet, not all the columns
defined in the parquet file.
For instance, the schema of the parquet file would look like this:
{
type: record,
name: ElectricPowerUsage,
namespace: jcascalog.parquet.example,
fields: [
63 matches
Mail list logo