Hi,
Currently I am trying to count on a document with multiple filter.
Let say, here is my document:
//user field1 field2 field3
user1 0 0 1
user2 0 1 0
user3 0 0 0
I want to count on user.log for some filters like this:
Filter1: field1 == 0 field 2 = 0
Filter2: field1 == 0 field 3 = 1
Hi I am trying to filter large table with 3 columns. Spark SQL might be a
good choice but want to do it without SQL. The goal is to filter bigtable
with multi clauses. I filtered bigtable 3times but the first filtering takes
about 50seconds but the second and third filter transformation took about
A workaround trick is found and put in the ticket
https://issues.apache.org/jira/browse/SPARK-4854. Hope this would be useful.
--
View this message in context:
Hi Jai,
Refer this doc and make sure your network is not blocking
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html
Also make sure you are using the same version of spark in both places (the
one on the cluster, and
Hi,
There's is a ongoing work on model export
https://www.github.com/apache/spark/pull/3062
For now, since LinearRegression is serializable you can save it as object
file :
sc.saveAsObjectFile(Seq(model))
then
val model = sc.objectFile[LinearRegresionWithSGD](path).first
model.predict(...)
Hi I am trying to filter a large table with 3 columns. My goal is to filter
this bigtable using multi clauses. I filtered bigtable 3 times but the first
filtering took about 50 seconds to complete whereas the second and third
filter transformation took about 5 seconds. I wonder if it is because of
Hi Das,
Thanks for your advice.
I'm not sure what's the usage of setting memoryFraction to 1. I've tried to
rerun the test again with the following parameters in spark_default.conf, but
failed again:
spark.rdd.compress true
spark.akka.frameSize 50
spark.storage.memoryFraction 0.8
Hi all,
I understand that parquet allows for schema versioning automatically in the
format; however, I'm not sure whether Spark supports this.
I'm saving a SchemaRDD to a parquet file, registering it as a table, then
doing an insertInto with a SchemaRDD with an extra column.
The second
Hi guys,
It happens to me quite often that when the locality level of a task goes
further than LOCAL (NODE, RACK, etc), I get some of the following
exceptions: too many files open, encountered unregistered class id,
cannot cast X to Y.
I do not get any exceptions during shuffling (which means
Hi,
I've built spark successfully with maven but when I try to run spark-shell I
get the following errors:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/deploy/SparkSubmit
Caused by:
print the CLASSPATH and make sure the spark assembly jar is there in the
classpath
Thanks
Best Regards
On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv danielru...@gmail.com wrote:
Hi,
I've built spark successfully with maven but when I try to run spark-shell
I get the following errors:
Spark
Hi Jeniba,
The second part of this meetup recording has a very good answer to your
question. TD explains the current behavior and the on-going work in Spark
Streaming to fix HA.
https://www.youtube.com/watch?v=jcJq3ZalXD8
-kr, Gerard.
On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson
I agree that this is not a trivial task as in this approach the kafka ack's
will be done by the SparkTasks that means a plug-able mean to ack your
input data source i.e. changes in core.
From my limited experience with Kafka + Spark what I've seem is If spark
tasks takes longer time than the
Our job is creating what appears to be an inordinate number of very small
tasks, which blow out our os inode and file limits. Rather than continually
upping those limits, we are seeking to understand whether our real problem
is that too many tasks are running, perhaps because we are
I've got a simple pyspark program that generates two CSV files and then
carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The
program works fine for smaller volumes of records, but when it goes beyond 3
million records for the fact dataset, I get the error below. I'm running
sc.textFile uses a hadoop input format. hadoop input formats by default
create one task per file, and they are not very suitable for many very
small files. can you turns your 1000 files into one larger text file?
otherwise maybe try:
val data = sc.textFile(/user/foo/myfiles/*).coalesce(100)
On
Try to repartition the data like:
val data = sc.textFile(/user/foo/myfiles/*).repartition(100)
Since the file size is small it shouldn't be a problem.
Thanks
Best Regards
On Tue, Dec 16, 2014 at 6:21 PM, bethesda swearinge...@mac.com wrote:
Our job is creating what appears to be an
Hi,How many clients and how many products do you have?CheersGen
jaykatukuri wrote
Hi all,I am running into an out of memory error while running ALS using
MLLIB on a reasonably small data set consisting of around 6 Million
ratings.The stack trace is below:java.lang.OutOfMemoryError: Java heap
Creating an RDD from a wildcard like this:
val data = sc.textFile(/user/foo/myfiles/*)
Will create 1 partition for each file found. 1000 files = 1000 partitions.
A task is a job stage (defined as a sequence of transformations) applied to
a partition, so 1000 partitions = 1000 tasks per stage.
Hi,
As you have 1,000 files, the RDD created by textFile will have 1,000
partitions. It is normal. In fact, as the same principal of HDFS, it is
better to store data with smaller number of files but larger size file.
You can use data.coalesce(10) to solve this problem(it reduce the number of
Hi Guys,
Im running a spark cluster in AWS with Spark 1.1.0 in EC2
I am trying to convert a an RDD with tuple
(u'string', int , {(int, int): int, (int, int): int})
to a schema rdd using the schema:
fields = [StructField('field1',StringType(),True),
Hi Demi,
Thanks for sharing.
What we usually do is let the driver read the configuration for the job and
pass the config object to the actual job as a serializable object. That way
avoids the need of a centralized config sharing point that needs to be
accessed from the workers. as you have
Do you actually need spark streaming per se for your use case? If you're
just trying to read data out of kafka into hbase, would something like this
non-streaming rdd work for you:
https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
Note
Thank you! I had known about the small-files problem in HDFS but didn't
realize that it affected sc.textFile().
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html
Sent from the Apache Spark User List mailing list archive
I think this is sort of a newbie question, but I've checked the api closely
and don't see an obvious answer:
Given an RDD, how would I create a new RDD of Tuples where the first Tuple
value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple
is the original RDD record? I'm
Hi
This time I need expert.
On 1.1.1 and only in cluster (standalone or EC2)
when I use this code :
countersPublishers.foreachRDD(rdd = {
rdd.foreachPartition(partitionRecords = {
partitionRecords.foreach(record = {
//dbActorUpdater ! updateDBMessage(record)
You would do:
rdd.zipWithIndexGives you an RDD[Original, Int] where the second
element is the index.
To have a (index,original) tuple, you will need to map that previous RDD to
the desired shape:
rdd.zipWithIndex.map(_.swap)
-kr, Gerard.
kr, Gerard.
On Tue, Dec 16, 2014 at 4:12 PM,
You could try using zipWIthIndex (links below to API docs). For example, in
python:
items =['a','b','c']
items2= sc.parallelize(items)
print(items2.first())
items3=items2.map(lambda x: (x, x+!))
print(items3.first())
items4=items3.zipWithIndex()
print(items4.first())
That's the first thing I tried... still the same error:
hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib
hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1
hdfs@ams-rsrv01:/tmp/spark/spark-branch-1.1$ ./bin/spark-shell
Spark assembly has been built with Hive, including
Take a look at combine file input format. Repartition or coalesce could
introduce shuffle I/O overhead.
On Dec 16, 2014 7:09 AM, bethesda swearinge...@mac.com wrote:
Thank you! I had known about the small-files problem in HDFS but didn't
realize that it affected sc.textFile().
--
View
Hi All,
I saw some helps online about forcing avro-mapred to hadoop2 using
classifiers.
Now my configuration is thus
val avro= org.apache.avro % avro-mapred % V.avro classifier
hadoop2
How ever I still get java.lang.IncompatibleClassChangeError. I think I am
not building spark
It turns out that this happens when checkpoint is set to a local directory
path. I have opened a JIRA SPARK-4862 for Spark streaming to output better
error message.
Thanks,
Aniket
On Tue Dec 16 2014 at 20:08:13 Aniket Bhatnagar aniket.bhatna...@gmail.com
wrote:
I am using spark 1.1.0 running a
This is how it looks on my machine.
[image: Inline image 1]
Thanks
Best Regards
On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv danielru...@gmail.com wrote:
That's the first thing I tried... still the same error:
hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib
Dear spark community,
We were testing a spark failure scenario where the executor that is running
a Kafka Receiver dies.
We are running our streaming jobs on top of mesos and we killed the mesos
slave that was running the executor ; a new executor was created on another
mesos-slave but
It seems to be slightly related to this:
https://issues.apache.org/jira/browse/SPARK-1340
But in this case, it's not the Task that is failing but the entire executor
where the Kafka Receiver resides.
2014-12-16 16:53 GMT+00:00 Luis Ángel Vicente Sánchez
langel.gro...@gmail.com:
Dear spark
Same here...
# jar tf lib/spark-assembly-1.1.2-SNAPSHOT-hadoop2.3.0.jar | grep
SparkSubmit.class
*org/apache/spark/deploy/SparkSubmit.class*
On Tue, Dec 16, 2014 at 6:50 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
This is how it looks on my machine.
[image: Inline image 1]
Thanks
And this is how my classpath looks like
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
I've added every jar in the lib dir to my classpath and still no luck:
Can you open the file bin/spark-class and then put an echo $CLASSPATH below
the place where they exports it and see what are the contents?
On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote:
I've added every jar in the lib dir to my classpath and still no luck:
Completely diffrent than the one I set:
Classpath is
I'm using CDH5 that was installed via Cloudera Manager.
Does it matter?
Thanks,
Daniel
On 16 בדצמ׳ 2014, at 19:18, Akhil Das ak...@sigmoidanalytics.com wrote:
Can you open the file bin/spark-class and then put an echo $CLASSPATH below
the place where they exports it and see what are the
Thanks! zipWithIndex() works well. I had overlooked it because the name
'zip' is rather odd
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html
Sent from the Apache Spark User List mailing
To ask a related question, if I use Zookeeper for table locking, will this
affect all attempts to access the Hive tables (including those from my Spark
applications) or only those made through the Thriftserver? In other words, does
Zookeeper provide concurrency for the Hive metastore in general
Okay,
I have an rdd that I want to run an aggregate over but it insists on
spilling to disk even though I structured the processing to only require a
single pass.
In other words, I can do all of my processing one entry in the rdd at a time
without persisting anything.
I set
Hi Manas,
There is a small patch needed for HDP2.2. You can refer to this PR
https://github.com/apache/spark/pull/3409
There are some other issues compiling against hadoop2.6. But we will fully
support it very soon. You can ping me, if you want.
Thanks.
Zhan Zhang
On Dec 12, 2014, at 11:38
In case a little more information is helpful:
the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a
.gz file (that's too big to fit on my disk).
I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect.
This rdd is what I'm calling aggregate on and I expect to
Nvm. I'm going to post another question since this has to do with the way
spark handles sc.textFile with a file://.gz
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html
Sent from the Apache Spark User
Hi all – I’m running a long running batch-processing job with Spark through
Yarn. I am doing the following
Batch Process
val resultsArr =
sc.accumulableCollection(mutable.ArrayBuffer[ListenableFuture[Result]]())
InMemoryArray.forEach{
1) Using a thread pool, generate callable jobs that
Also, this may be related to this issue
https://issues.apache.org/jira/browse/SPARK-3885.
Further, to clarify, data is being written to Hadoop on the data nodes.
Would really appreciate any help. Thanks!
From: Ganelin, Ganelin, Ilya
Hi guys
I am hoping someone might have a clue on why this is happening. Otherwise I
will have to dwell into YARN module's source code to better understand the
issue.
On Wed, Dec 10, 2014, 11:54 PM Aniket Bhatnagar aniket.bhatna...@gmail.com
wrote:
I am running spark 1.1.0 on AWS EMR and I am
Is there a way to get Spark to NOT reparition/shuffle/expand a
sc.textFile(fileUri) when the URI is a gzipped file?
Expanding a gzipped file should be thought of as a transformation and not
an action (if the analogy is apt). There is no need to fully create and
fill out an intermediate RDD with
Hi all,
I am trying to run a simple Spark_Pi application through Yarn from Java code. I
have the Spark_Pi class and everything works fine if I run on Spark. However,
when I set master to yarn-client and set yarn mode to true, I keep getting
exceptions. I suspect this has something to do with
I think accumulators do exactly what you want.
(Scala syntax below, I'm just not familiar with the Java equivalent ...)
val f1counts = sc.accumulator (0)
val f2counts = sc.accumulator (0)
val f3counts = sc.accumulator (0)
textfile.foreach { s =
if(f1matches) f1counts += 1
...
}
Note that
Thanks Cheng. Tried it out and saw the InMemoryColumnarTableScan word in the
physical plan.
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Friday, December 12, 2014 11:37 PM
To: Judy Nash; user@spark.apache.org
Subject: Re: Spark SQL API Doc IsCached as SQL command
There isn’t a SQL
I am having the same issue, and it still does not update for me. I am trying
to execute the example by using bin/run-example
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-linear-regression-is-not-building-the-model-tp18522p20727.html
Sent from
Hi,
I am trying to run StreamingLinearRegression example on single node w/ out
hadoop installed. I keep getting the following error and cannot find any
documentation about the issue:
/14/12/16 13:26:50 ERROR JobScheduler: Error running job streaming job
141875801 ms.1
wow, really weird. My intuition is the same as everyone else's, some
unprintable character. Here's a couple more debugging tricks I've used in
the past:
//set up an accumulator to catch the bad rows as a side-effect
val nBadRows = sc.accumulator(0)
val nGoodRows = sc.accumulator(0)
val badRows
Hi All,
My application load 1000 files, each file from 200M - a few GB, and combine
with other data to do calculation.
Some pre-calculation must be done on each file level, then after that, the
result need to combine to do further calculation.
In Hadoop, it is simple because I can
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s
On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas gerard.m...@gmail.com wrote:
Hi Jeniba,
The second part of this meetup recording has a very good answer to your
question. TD explains the current behavior and the on-going
It's a bug, could you file a JIRA for this? thanks!
On Tue, Dec 16, 2014 at 5:49 AM, sahanbull sa...@skimlinks.com wrote:
Hi Guys,
Im running a spark cluster in AWS with Spark 1.1.0 in EC2
I am trying to convert a an RDD with tuple
(u'string', int , {(int, int): int, (int, int): int})
I've been running a job in local mode using --master local[*] and I've
noticed that, for some reason, exceptions appear to get eaten- as in, I
don't see them. If i debug in my IDE, I'll see that an exception was thrown
if I step through the code but if I just run the application, it appears
I had created https://issues.apache.org/jira/browse/SPARK-4866, it
will be fixed by https://github.com/apache/spark/pull/3714.
Thank you for reporting this.
Davies
On Tue, Dec 16, 2014 at 12:44 PM, Davies Liu dav...@databricks.com wrote:
It's a bug, could you file a JIRA for this? thanks!
On
Are you certain that's happening Jim? Why? What happens if you just do
sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat
for gzip and the RDD wrapper around it already has the streaming
behaviour you wish for. but I could be wrong. Also, are you in pyspark or
scala Spark?
hey igor!
a few ways to work around this depending on the level of exception-handling
granularity you're willing to accept:
1) use mapPartitions() to wrap the entire partition handling code in a
try/catch -- this is fairly coarse-grained, however, and will fail the
entire partition.
2) modify
Your Spark is trying to load a hadoop library winutils.exe, which you
don't have in your Windows:
14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at
Hi Harry,
Thanks for your response.
I'm working in scala. When I do a count call it expands the RDD in the
count (since it's an action). You can see the call stack that results in
the failure of the job here:
ERROR DiskBlockObjectWriter - Uncaught exception while reverting
partial writes
Are you reading the file from your driver (main / master) program?
Is your file in a distributed system like HDFS? available to all your nodes?
It might be due to the laziness of transformations:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
Transformations are lazy,
Hi All,
I need help with regex in my sc.textFile()
I have lots of files with with epoch millisecond timestamp.
ex:abc_1418759383723.json
Now I need to consume last one hour files using the epoch time stamp as
mentioned above.
I tried couple of options , nothing seems working for me.
If any
I've been trying to figure out how to use Spark to do a simple aggregation
without reparitioning and essentially creating fully instantiated
intermediate RDDs and it seem virtually impossible.
I've now gone as far as writing my own single parition RDD that wraps an
Iterator[String] and calling
Wow. i just realized what was happening and it's all my fault. I have a
library method that I wrote that presents the RDD and I was actually
repartitioning it myself.
I feel pretty dumb. Sorry about that.
--
View this message in context:
Hi
Recently I have some problems about rdd behaviors.It's about
RDD.first,RDD.toArray method when RDD only has one element.
I get the different result in different method from one element RDD
where i
should have the same result. I will give more detail after the code.
My
Thanks for the clarifications. I misunderstood what the number on UI meant.
On Mon, Dec 15, 2014 at 7:00 PM, Sean Owen so...@cloudera.com wrote:
I believe this corresponds to the 0.6 of the whole heap that is
allocated for caching partitions. See spark.storage.memoryFraction on
Hi !
when will the spark 1.3.0 be released?
I want to use new LDA feature.
Thank you!
When it is ready.
On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote:
Hi £¡
when will the spark 1.3.0 be released£¿
I want to use new LDA feature.
Thank
you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™P
RDD.persist() can be useful here.
On 11 December 2014 at 14:34, ankits [via Apache Spark User List]
ml-node+s1001560n20613...@n3.nabble.com wrote:
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too
fast. How can i inspect the size of RDD in memory and get more information
Actually there was still Fetch failure. However, after I upgrade the spark to
1.1.1, this error was not met again.
Thanks,
Mars
发件人: Akhil Das [mailto:ak...@sigmoidanalytics.com]
发送时间: 2014年12月16日 17:52
收件人: Ma,Xi
抄送: u...@spark.incubator.apache.org
主题: Re: 答复: Fetch Failed caused job failed.
Hi,
I'm using the Scala DSL for Spark SQL, but I'm not able to do joins. I
have two tables (backed by Parquet files) and I need to do a join across
them using a common field (user_id). This works fine using standard SQL
but not using the language-integrated DSL neither
t1.join(t2, on =
Releases are roughly every 3mo so you should expect around March if the
pace stays steady.
2014-12-16 22:56 GMT-05:00 Marco Shaw marco.s...@gmail.com:
When it is ready.
On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote:
Hi £¡
when will the spark 1.3.0 be released£¿
I
Hi there
I also got exception when running PI example on YARN
Spark version: spark-1.1.1-bin-hadoop2.4
My environment: Hortonworks HDP 2.2
My command:
./bin/spark-submit --master yarn-cluster --class
org.apache.spark.examples.SparkPi lib/spark-examples*.jar 10
Output logs:
14/12/17 14:06:32
Hi, Shuai,
How did you turn off the file split in Hadoop? I guess you might have
implemented a customized FileInputFormat which overrides isSplitable() to
return FALSE. If you do have such FileInputFormat, you can simply pass it as a
constructor parameter to HadoopRDD or NewHadoopRDD in Spark.
Another problem with the DSL:
t1.where('term == dmin).count() returns zero. But
sqlCtx.sql(select * from t1 where term = 'dmin').count() returns 700,
which I know is correct from the data. Is there something wrong with how
I'm using the DSL?
Thanks
On 17/12/14 11:13 am, Jerry Raj wrote:
Gautham,
How many number of gz files do you have? Maybe the reason is that gz file is
compressed that can't be splitted for processing by Mapreduce. A single gz
file can only be processed by a single Mapper so that the CPU treads can't be
fully utilized.
-Original Message-
From:
Hi,
I have a Spark cluster using standalone mode. Spark Master is
configured as High Availablity mode.
Now I am going to upgrade Spark from 1.0 to 1.1, but don't want to
interrupt the currently running jobs.
(1) Are there any way to perform a rolling upgrade (while running a job)?
(2) If not,
Jerry,
On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj jerry@gmail.com wrote:
Another problem with the DSL:
t1.where('term == dmin).count() returns zero.
Looks like you need ===:
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
Tobias
I get the key point . The problem is in sc.sequenceFile,From API description
RDD will create many references to the same objecty ,So I revise the code
sessions.getBytes to sessions.getBytes.clone,
It seems to work.
Thanks.
--
View this message in context:
Did you try something like:
//Get the last hour
val d = (System.currentTimeMillis() - 3600 * 1000)
val ex = abc_ + d.toString().substring(0,7) + *.json
[image: Inline image 1]
Thanks
Best Regards
On Wed, Dec 17, 2014 at 5:05 AM, durga durgak...@gmail.com wrote:
Hi All,
I need help with
I also got the same problem..
2014-12-09 22:58 GMT+08:00 Daniel Haviv danielru...@gmail.com:
Hi,
I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I
get the following exception:
14/12/09 06:54:24 INFO server.AbstractConnector: Started
Could you post the stack trace?
Best Regards,
Shixiong Zhu
2014-12-16 23:21 GMT+08:00 richiesgr richie...@gmail.com:
Hi
This time I need expert.
On 1.1.1 and only in cluster (standalone or EC2)
when I use this code :
countersPublishers.foreachRDD(rdd = {
Hi guys,
I get Kryo exceptions of the type unregistered class id and cannot cast
to class when the locality level of the tasks go beyond LOCAL.
However I get no Kryo exceptions during shuffling operations.
If the locality level never goes beyond LOCAL everything works fine.
Is there a special
89 matches
Mail list logo