I am not personally aware of a repo for snapshot builds.
In my use case, I had to build spark 1.2.1-snapshot
see https://spark.apache.org/docs/latest/building-spark.html
2015-01-30 17:11 GMT+01:00 Debajyoti Roy debajyoti@healthagen.com:
Thanks Ayoub and Zhan,
I am new to spark and wanted
Hi Arush
I have configured log4j by updating the file log4j.properties in
SPARK_HOME/conf folder.
If it was a log4j defect we would get error in debug mode in all apps.
Thanks
Ankur
Hi Ankur,
How are you enabling the debug level of logs. It should be a log4j
configuration. Even if there would
But isn't foldLeft() overkill for the originally stated use case of max diff of
adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative
accumulation as opposed to an embarrassingly parallel operation such as this
one?
This use case reminds me of FIR filtering in DSP. It
Yup, if you turn off YARN's CPU scheduling then you can run executors to
take advantage of the extra memory on the larger boxes. But then some of
the nodes will end up severely oversubscribed from a CPU perspective, so I
would definitely recommend against that.
On Fri, Jan 30, 2015 at 3:31 AM,
Hi Krishna/all,
I think I found it, and it wasn't related to Scala-2.11...
I had spark.eventLog.dir=/mnt/spark/work/history, which worked
in Spark 1.2, but now am running Spark master, and it wants a
Hadoop URI, e.g. file:///mnt/spark/work/history (I believe due to
commit 45645191).
This looks
and if its a single giant timeseries that is already sorted then Mohit's
solution sounds good to me.
On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak michaelma...@yahoo.com
wrote:
But isn't foldLeft() overkill for the originally stated use case of max
diff of adjacent pairs? Isn't foldLeft()
Hello,
I am seeing negative values for accumulators. Here's my implementation in a
standalone app in Spark 1.1.1rc:
implicit object BigIntAccumulatorParam extends AccumulatorParam[BigInt] {
def addInPlace(t1: Int, t2: BigInt) = BigInt(t1) + t2
def addInPlace(t1: BigInt, t2: BigInt) =
I've been trying to run HiveQL queries with UDFs in Spark SQL, but with no
success. The problem occurs only when using functions, like the
from_unixtime (represented by the Hive class UDFFromUnixTime).
I'm using Spark 1.2 with CDH5.3.0. Running the queries in local mode work,
but in Yarn mode
Hi,
I am using the utility function kFold provided in Spark for doing k-fold
cross validation using logistic regression. However, each time I run the
experiment, I got different different result. Since everything else stays
constant, I was wondering if this is due to the kFold function I used.
Sanity-check: would it be possible that `threshold_var` be negative ?
—
FG
On Fri, Jan 30, 2015 at 5:06 PM, Peter Thai thai.pe...@gmail.com wrote:
Hello,
I am seeing negative values for accumulators. Here's my implementation in a
standalone app in Spark 1.1.1rc:
implicit object
Have a look at the source code for MLUtils.kFold. Yes, there is a
random element. That's good; you want the folds to be randomly chosen.
Note there is a seed parameter, as in a lot of the APIs, that lets you
fix the RNG seed and so get the same result every time, if you need
to.
On Fri, Jan 30,
yeah i meant foldLeft by key, sorted by date.
it is non-commutative because i care about the order of processing the
values (chronological). i dont see how i can do it with a reduce
efficiently, but i would be curious to hear otherwise. i might be biased
since this is such a typical operation in
Amit - IJ will not find it until you add the import as Sean mentioned. It
includes implicits that intellij will not know about otherwise.
2015-01-30 12:44 GMT-08:00 Amit Behera amit.bd...@gmail.com:
I am sorry Sean.
I am developing code in intelliJ Idea. so with the above dependencies I am
So far, the canonical way to materialize an RDD just to make sure it's
cached is to call count(). That's fine but incurs the overhead of
actually counting the elements.
However, rdd.foreachPartition(p = None) for example also seems to
cause the RDD to be materialized, and is a no-op. Is that a
Hello Everyone,
A bit confused on this one...I have set up the KafkaWordCount found here:
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
Everything runs fine when I run it using this on instance A:
From your stacktrace it appears that the S3 writer tries to write the data
to a temp file on the local file system first. Taking a guess, that local
directory doesn't exist or you don't have permissions for it.
-Sven
On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com
I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case.
[1]
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
Davies
On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
Hi,
I want to process some
Ah, this is in particular an issue due to sort-based shuffle (it was not
the case for hash-based shuffle, which would immediately serialize each
record rather than holding many in memory at once). The documentation
should be updated.
On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza
Koert, thanks for the referral to your current pull request! I found it
very thoughtful and thought-provoking.
On Fri, Jan 30, 2015 at 9:19 AM, Koert Kuipers ko...@tresata.com wrote:
and if its a single giant timeseries that is already sorted then Mohit's
solution sounds good to me.
On
Hi Charles,
I forgot to mention. But I imported the following
import au.com.bytecode.opencsv.CSVParser
import org.apache.spark._
On Sat, Jan 31, 2015 at 2:09 AM, Charles Feduke charles.fed...@gmail.com
wrote:
Define not working. Not compiling? If so you need:
import
Thank you very much Charles, I got it :)
On Sat, Jan 31, 2015 at 2:20 AM, Charles Feduke charles.fed...@gmail.com
wrote:
You'll still need to:
import org.apache.spark.SparkContext._
Importing org.apache.spark._ does _not_ recurse into sub-objects or
sub-packages, it only brings in
Is your code hitting frequent garbage collection?
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote:
Hi,
I am running my graphx application on Spark 1.2.0(11
Yes, I think so, esp. for a pregel application… have any suggestion?
Best,
Yifan LI
On 30 Jan 2015, at 22:25, Sonal Goyal sonalgoy...@gmail.com wrote:
Is your code hitting frequent garbage collection?
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co/
-_- Sorry for the spam...I thought I could run spark apps on workers, but I
cloned it on my spark master and now it works.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-deploying-spark-program-because-of-soft-link-tp21450p21451.html
Sent from the
This is on spark 1.2
I am loading ~6k parquet files, roughly 500 MB each into a schemaRDD, and
calling count() on it.
After loading about 2705 tasks (there is one per file), the job crashes with
this error:
Total size of serialized results of 2705 tasks (1024.0 MB) is bigger than
I am sorry Sean.
I am developing code in intelliJ Idea. so with the above dependencies I am
not able to find *groupByKey* when I am searching by ctrl+space
On Sat, Jan 31, 2015 at 2:04 AM, Sean Owen so...@cloudera.com wrote:
When you post a question anywhere, and say it's not working, you
You'll still need to:
import org.apache.spark.SparkContext._
Importing org.apache.spark._ does _not_ recurse into sub-objects or
sub-packages, it only brings in whatever is at the level of the package or
object imported.
SparkContext._ has some implicits, one of them for adding groupByKey to an
Sorry if this is a double post...I'm not sure if I can send from email or
have to come to the user list to create a new topic.
A bit confused on this one...I have set up the KafkaWordCount found here:
Are you using the default Java object serialization, or have you tried Kryo
yet? If you haven't tried Kryo please do and let me know how much it
impacts the serialization size. (I know its more efficient, I'm curious to
know how much more efficient, and I'm being lazy - I don't have ~6K 500MB
Spark 1.2
While building schemaRDD using StructType
xxx = new StructField(credit_amount, DecimalType, true) gives error type
mismatch; found : org.apache.spark.sql.catalyst.types.DecimalType.type
required: org.apache.spark.sql.catalyst.types.DataType
From
You are grabbing the singleton, not the class. You need to specify the
precision (i.e. DecimalType.Unlimited or DecimalType(precision, scale))
On Fri, Jan 30, 2015 at 2:23 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
While building schemaRDD using StructType
xxx = new
Theoretically your approach would require less overhead - i.e. a collect on
the driver is not required as the last step. But maybe the difference is
small and that particular path may or may not have been properly optimized
vs the count(). Do you have a biggish data set to compare the timings?
Hi Guys,
I would like to put in the kafkawordcount scala code the kafka parameter: val
kafkaParams = Map(“fetch.message.max.bytes” - “400”). I’ve put this
variable like this
val KafkaDStreams = (1 to numStreams) map {_ =
We are running a Spark streaming job that retrieves files from a directory
(using textFileStream).
One concern we are having is the case where the job is down but files are
still being added to the directory.
Once the job starts up again, those files are not being picked up (since
they are not
Define not working. Not compiling? If so you need:
import org.apache.spark.SparkContext._
On Fri Jan 30 2015 at 3:21:45 PM Amit Behera amit.bd...@gmail.com wrote:
hi all,
my sbt file is like this:
name := Spark
version := 1.0
scalaVersion := 2.10.4
libraryDependencies +=
Hi Andrew,
Here's a note from the doc for sequenceFile:
* '''Note:''' Because Hadoop's RecordReader class re-uses the same
Writable object for each
* record, directly caching the returned RDD will create many references
to the same object.
* If you plan to directly cache Hadoop
Hi,
When running big mapreduce operation with pyspark (in the particular case using
lot of sets and operations on sets in the map tasks so likely to be allocating
and freeing loads of pages) I eventually get kernel error 'python: page
allocation failure: order:10, mode:0x2000d0' plus very
Are you using SGD for logistic regression? There's a random element
there too, by nature. I looked into the code and see that you can't
set a seed, but actually, the sampling is done with a fixed seed per
partition anyway. Hm.
In general you would not expect these algorithms to produce the same
hi all,
my sbt file is like this:
name := Spark
version := 1.0
scalaVersion := 2.10.4
libraryDependencies += org.apache.spark %% spark-core % 1.1.0
libraryDependencies += net.sf.opencsv % opencsv % 2.3
*code:*
object SparkJob
{
def pLines(lines:Iterator[String])={
val parser=new
Filed https://issues.apache.org/jira/browse/SPARK-5500 for this.
-Sandy
On Fri, Jan 30, 2015 at 11:59 AM, Aaron Davidson ilike...@gmail.com wrote:
Ah, this is in particular an issue due to sort-based shuffle (it was not
the case for hash-based shuffle, which would immediately serialize each
You can also use your InputFormat/RecordReader in Spark, e.g. using
newAPIHadoopFile. See here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
.
-Sven
On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:
Hi,
I want to process
Right. Which makes me to believe that the directory is perhaps configured
somewhere and i have missed configuring the same. The process that is
submitting jobs (basically becomes driver) is running in sudo mode and the
executors are executed by YARN. The hadoop username is configured as
'hadoop'
Hello,
I have a problem when querying, with a hive context on spark
1.2.1-snapshot, a column in my table which is nested data structure like an
array of struct.
The problems happens only on the table stored as parquet, while querying
the Schema RDD saved, as a temporary table, don't lead to any
Is it possible that your schema contains duplicate columns or column with
spaces in the name? The parquet library will often give confusing error
messages in this case.
On Fri, Jan 30, 2015 at 10:33 AM, Ayoub benali.ayoub.i...@gmail.com wrote:
Hello,
I have a problem when querying, with a
Thanks. I did specify a seed parameter.
Seems that the problem is not caused by kFold. I actually ran another
experiment without cross validation. I just built a model with the training
data and then tested the model on the test data. However, the accuracy
still varies from one run to another.
No it is not the case, here is the gist to reproduce the issue
https://gist.github.com/ayoub-benali/54d6f3b8635530e4e936
On Jan 30, 2015 8:29 PM, Michael Armbrust mich...@databricks.com wrote:
Is it possible that your schema contains duplicate columns or column with
spaces in the name? The
Hi Amit,
What error does it through?
Thanks
Arush
On Sat, Jan 31, 2015 at 1:50 AM, Amit Behera amit.bd...@gmail.com wrote:
hi all,
my sbt file is like this:
name := Spark
version := 1.0
scalaVersion := 2.10.4
libraryDependencies += org.apache.spark %% spark-core % 1.1.0
if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.
On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser kras...@gmail.com wrote:
You can also use your InputFormat/RecordReader in Spark, e.g. using
I've found a strange issue when trying to sort a lot of data in HDFS using
spark 1.2.0 (CDH5.3.0). My data is in sequencefiles and the key is a class
that derives from BytesWritable (the value is also a BytesWritable). I'm
using a custom KryoSerializer to serialize the underlying byte array
According to the Gist Ayoub provided, the schema is fine. I reproduced
this issue locally, it should be bug, but I don't think it's related to
SPARK-5236. Will investigate this soon.
Ayoub - would you mind to help to file a JIRA for this issue? Thanks!
Cheng
On 1/30/15 11:28 AM, Michael
Hi Capitão,
Since you're using CDH, your question is probably more appropriate for
the cdh-u...@cloudera.org list.
The problem you're seeing is most probably an artifact of the way CDH
is currently packaged. You have to add Hive jars manually to you Spark
app's classpath if you want to use the
Hi Guys,
some idea how solve this error
[error]
/sata_disk/workspace/spark-1.1.1/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala:76:
missing parameter type for expanded function ((x$6, x$7) = x$6.$plus(x$7))
We have a series of spark jobs which run in succession over various cached
datasets, do small groups and transforms, and then call
saveAsSequenceFile() on them.
Each call to save as a sequence file appears to have done its work, the
task says it completed in xxx.x seconds but then it pauses
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving the count is probably a win, but not
big. Well, maybe good to know.
On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch java...@gmail.com wrote:
Theoretically your approach would require less
According to the Gist Ayoub provided, the schema is fine. I reproduced
this issue locally, it should be bug, but I don't think it's related to
SPARK-5236. Will investigate this soon.
Ayoub - would you mind to help to file a JIRA for this issue? Thanks!
Cheng
On 1/30/15 11:28 AM, Michael
Hi,
I have a stream pipeline which invokes map, reduceByKey, filter, and
flatMap. How can I measure the time taken in each stage?
Thanks,
Josh
Here is the same issues:
[1]
http://stackoverflow.com/questions/28186607/java-lang-classcastexception-using-lambda-expressions-in-spark-job-on-remote-ser
[2]
That is a known issue uncovered last week. It fails on certain
environments, not on Jenkins which is our testing environment.
There is already a PR up to fix it. For now you can build using mvn
package -DskipTests
TD
On Fri, Jan 30, 2015 at 8:59 PM, Andrew Musselman
andrew.mussel...@gmail.com
Yeah, currently there isn't such a repo. However, the Spark team is
working on this.
Cheng
On 1/30/15 8:19 AM, Ayoub wrote:
I am not personally aware of a repo for snapshot builds.
In my use case, I had to build spark 1.2.1-snapshot
see
Off master, got this error; is that typical?
---
T E S T S
---
Running org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.495
Not quiet sure, but it could be the GC Pause, if you are holding too much
objects in memory. You can check this tuning
http://spark.apache.org/docs/1.2.0/tuning.html part if you haven't
already been through it.
Thanks
Best Regards
On Sat, Jan 31, 2015 at 7:22 AM, Corey Nolet cjno...@gmail.com
I believe From the webui (running on port 8080) you will get these
measurements.
Thanks
Best Regards
On Sat, Jan 31, 2015 at 9:29 AM, Josh J joshjd...@gmail.com wrote:
Hi,
I have a stream pipeline which invokes map, reduceByKey, filter, and
flatMap. How can I measure the time taken in each
This is how i do it:
val tmp = test.map(x = (x, 1L)).reduceByWindow({ case ((word1, count1),
(word2, count2)) = (word1 + + word2, count1 + count2)}, Seconds(10),
Seconds(10))
In your case you are actually having a type mismatch:
[image: Inline image 1]
Thanks
Best Regards
On Sat, Jan 31,
assuming the data can be partitioned then you have many timeseries for
which you want to detect potential gaps. also assuming the resulting gaps
info per timeseries is much smaller data then the timeseries data itself,
then this is a classical example to me of a sorted (streaming) foldLeft,
Hi,
I am running my graphx application on Spark 1.2.0(11 nodes cluster), has
requested 30GB memory per node and 100 cores for around 1GB input dataset(5
million vertices graph).
But the error below always happen…
Is there anyone could give me some points?
(BTW, the overall
Hi,
I am running my graphx application on Spark 1.2.0(11 nodes cluster), has
requested 30GB memory per node and 100 cores for around 1GB input dataset(5
million vertices graph).
But the error below always happen…
Is there anyone could give me some points?
(BTW, the overall edge/vertex RDDs
Hi,
I want to process some files, there're a king of big, dozens of
gigabytes each one. I get them like a array of bytes and there's an
structure inside of them.
I have a header which describes the structure. It could be like:
Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
You should not disable the GC overhead limit. How does increasing executor
total memory cause you to not have enough memory? Do you mean something
else?
On Jan 30, 2015 1:16 AM, ey-chih chow eyc...@hotmail.com wrote:
I use the default value, which I think is 512MB. If I change to 1024MB,
Spark
By default, HashingTF turns each document into a sparse vector in R^(2^20),
i.e. a million dimensional space. The current Spark clusterer turns each
sparse into a dense vector with a million entries when it is added to a
cluster. Hence, the memory needed grows as the number of clusters times 8M
Hi, Siddharth
You can re build spark with maven by specifying -Dhadoop.version=2.5.0
Thanks,
Sun.
fightf...@163.com
From: Siddharth Ubale
Date: 2015-01-30 15:50
To: user@spark.apache.org
Subject: Hi: hadoop 2.5 for spark
Hi ,
I am beginner with Apache spark.
Can anyone let me know if it
You can use prebuilt version that is built upon hadoop2.4.
From: Siddharth Ubale
Date: 2015-01-30 15:50
To: user@spark.apache.org
Subject: Hi: hadoop 2.5 for spark
Hi ,
I am beginner with Apache spark.
Can anyone let me know if it is mandatory to build spark with the Hadoop
version I am
There is no need for a 2.5 profile. The hadoop-2.4 profile is for
Hadoop 2.4 and beyond. You can set the particular version you want
with -Dhadoop.version=
You do not need to make any new profile to compile vs 2.5.0-cdh5.2.1.
Again, the hadoop-2.4 profile is what you need.
On Thu, Jan 29, 2015
I get a serialization problem trying to run
Python:
sc.parallelize(['1','2']).map(lambda id: client.getRow('table', id, None))
cloudpickle.py can't pickle method_descriptor type
I add a function to pickle a method descriptor and now it exceeds the
recursion limit
I print the method name before i
Hi Somya,
I meant when you configure the JAVA_OPTS and when you don't configure the
JAVA_OPTS is there any difference in the error message?
Are you facing the same issue when you built using maven?
Thanks
Arush
On Thu, Jan 29, 2015 at 10:22 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
Hi,
Whenever I start spark shell I get this warning.
WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Whats the meaning of this and does/how can it impact the execution of my
spark jobs ?
Please suggest how can I fix
This is ignorable, and a message from Hadoop, which basically means
what it says. It's almost infamous; search Google. You don't have to
do anything.
On Fri, Jan 30, 2015 at 1:04 PM, kundan kumar iitr.kun...@gmail.com wrote:
Hi,
Whenever I start spark shell I get this warning.
WARN
Sorry, but I think there’s a disconnect.
When you launch a job under YARN on any of the hadoop clusters, the number of
mappers/reducers is not set and is dependent on the amount of available
resources.
So under Ambari, CM, or MapR’s Admin, you should be able to specify the amount
of
77 matches
Mail list logo