Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-30 Thread Ayoub
I am not personally aware of a repo for snapshot builds. In my use case, I had to build spark 1.2.1-snapshot see https://spark.apache.org/docs/latest/building-spark.html 2015-01-30 17:11 GMT+01:00 Debajyoti Roy debajyoti@healthagen.com: Thanks Ayoub and Zhan, I am new to spark and wanted

Re: Error when running spark in debug mode

2015-01-30 Thread Ankur Srivastava
Hi Arush I have configured log4j by updating the file log4j.properties in SPARK_HOME/conf folder. If it was a log4j defect we would get error in debug mode in all apps. Thanks Ankur Hi Ankur, How are you enabling the debug level of logs. It should be a log4j configuration. Even if there would

Re: spark challenge: zip with next???

2015-01-30 Thread Michael Malak
But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative accumulation as opposed to an embarrassingly parallel operation such as this one? This use case reminds me of FIR filtering in DSP. It

Re: HW imbalance

2015-01-30 Thread Sandy Ryza
Yup, if you turn off YARN's CPU scheduling then you can run executors to take advantage of the extra memory on the larger boxes. But then some of the nodes will end up severely oversubscribed from a CPU perspective, so I would definitely recommend against that. On Fri, Jan 30, 2015 at 3:31 AM,

Re: spark-shell working in scala-2.11 (breaking change?)

2015-01-30 Thread Stephen Haberman
Hi Krishna/all, I think I found it, and it wasn't related to Scala-2.11... I had spark.eventLog.dir=/mnt/spark/work/history, which worked in Spark 1.2, but now am running Spark master, and it wants a Hadoop URI, e.g. file:///mnt/spark/work/history (I believe due to commit 45645191). This looks

Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
and if its a single giant timeseries that is already sorted then Mohit's solution sounds good to me. On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak michaelma...@yahoo.com wrote: But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft()

Negative Accumulators

2015-01-30 Thread Peter Thai
Hello, I am seeing negative values for accumulators. Here's my implementation in a standalone app in Spark 1.1.1rc: implicit object BigIntAccumulatorParam extends AccumulatorParam[BigInt] { def addInPlace(t1: Int, t2: BigInt) = BigInt(t1) + t2 def addInPlace(t1: BigInt, t2: BigInt) =

Spark SQL - Unable to use Hive UDF because of ClassNotFoundException

2015-01-30 Thread Capitão
I've been trying to run HiveQL queries with UDFs in Spark SQL, but with no success. The problem occurs only when using functions, like the from_unixtime (represented by the Hive class UDFFromUnixTime). I'm using Spark 1.2 with CDH5.3.0. Running the queries in local mode work, but in Yarn mode

Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Hi, I am using the utility function kFold provided in Spark for doing k-fold cross validation using logistic regression. However, each time I run the experiment, I got different different result. Since everything else stays constant, I was wondering if this is due to the kFold function I used.

Re: Negative Accumulators

2015-01-30 Thread francois . garillot
Sanity-check: would it be possible that `threshold_var` be negative ? — FG On Fri, Jan 30, 2015 at 5:06 PM, Peter Thai thai.pe...@gmail.com wrote: Hello, I am seeing negative values for accumulators. Here's my implementation in a standalone app in Spark 1.1.1rc: implicit object

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Have a look at the source code for MLUtils.kFold. Yes, there is a random element. That's good; you want the folds to be randomly chosen. Note there is a seed parameter, as in a lot of the APIs, that lets you fix the RNG seed and so get the same result every time, if you need to. On Fri, Jan 30,

Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
yeah i meant foldLeft by key, sorted by date. it is non-commutative because i care about the order of processing the values (chronological). i dont see how i can do it with a reduce efficiently, but i would be curious to hear otherwise. i might be biased since this is such a typical operation in

Re: groupByKey is not working

2015-01-30 Thread Stephen Boesch
Amit - IJ will not find it until you add the import as Sean mentioned. It includes implicits that intellij will not know about otherwise. 2015-01-30 12:44 GMT-08:00 Amit Behera amit.bd...@gmail.com: I am sorry Sean. I am developing code in intelliJ Idea. so with the above dependencies I am

Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
So far, the canonical way to materialize an RDD just to make sure it's cached is to call count(). That's fine but incurs the overhead of actually counting the elements. However, rdd.foreachPartition(p = None) for example also seems to cause the RDD to be materialized, and is a no-op. Is that a

Trouble deploying spark program because of soft link?

2015-01-30 Thread Su She
Hello Everyone, A bit confused on this one...I have set up the KafkaWordCount found here: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java Everything runs fine when I run it using this on instance A:

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Sven Krasser
From your stacktrace it appears that the S3 writer tries to write the data to a temp file on the local file system first. Taking a guess, that local directory doesn't exist or you don't have permissions for it. -Sven On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com

Re: Define size partitions

2015-01-30 Thread Davies Liu
I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case. [1] http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords Davies On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Hi, I want to process some

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Aaron Davidson
Ah, this is in particular an issue due to sort-based shuffle (it was not the case for hash-based shuffle, which would immediately serialize each record rather than holding many in memory at once). The documentation should be updated. On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza

Re: spark challenge: zip with next???

2015-01-30 Thread Derrick Burns
Koert, thanks for the referral to your current pull request! I found it very thoughtful and thought-provoking. On Fri, Jan 30, 2015 at 9:19 AM, Koert Kuipers ko...@tresata.com wrote: and if its a single giant timeseries that is already sorted then Mohit's solution sounds good to me. On

Re: groupByKey is not working

2015-01-30 Thread Amit Behera
Hi Charles, I forgot to mention. But I imported the following import au.com.bytecode.opencsv.CSVParser import org.apache.spark._ On Sat, Jan 31, 2015 at 2:09 AM, Charles Feduke charles.fed...@gmail.com wrote: Define not working. Not compiling? If so you need: import

Re: groupByKey is not working

2015-01-30 Thread Amit Behera
Thank you very much Charles, I got it :) On Sat, Jan 31, 2015 at 2:20 AM, Charles Feduke charles.fed...@gmail.com wrote: You'll still need to: import org.apache.spark.SparkContext._ Importing org.apache.spark._ does _not_ recurse into sub-objects or sub-packages, it only brings in

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Sonal Goyal
Is your code hitting frequent garbage collection? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am running my graphx application on Spark 1.2.0(11

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Yifan LI
Yes, I think so, esp. for a pregel application… have any suggestion? Best, Yifan LI On 30 Jan 2015, at 22:25, Sonal Goyal sonalgoy...@gmail.com wrote: Is your code hitting frequent garbage collection? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co/

Re: Trouble deploying spark program because of soft link?

2015-01-30 Thread suhshekar52
-_- Sorry for the spam...I thought I could run spark apps on workers, but I cloned it on my spark master and now it works. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-deploying-spark-program-because-of-soft-link-tp21450p21451.html Sent from the

Serialized task result size exceeded

2015-01-30 Thread ankits
This is on spark 1.2 I am loading ~6k parquet files, roughly 500 MB each into a schemaRDD, and calling count() on it. After loading about 2705 tasks (there is one per file), the job crashes with this error: Total size of serialized results of 2705 tasks (1024.0 MB) is bigger than

Re: groupByKey is not working

2015-01-30 Thread Amit Behera
I am sorry Sean. I am developing code in intelliJ Idea. so with the above dependencies I am not able to find *groupByKey* when I am searching by ctrl+space On Sat, Jan 31, 2015 at 2:04 AM, Sean Owen so...@cloudera.com wrote: When you post a question anywhere, and say it's not working, you

Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
You'll still need to: import org.apache.spark.SparkContext._ Importing org.apache.spark._ does _not_ recurse into sub-objects or sub-packages, it only brings in whatever is at the level of the package or object imported. SparkContext._ has some implicits, one of them for adding groupByKey to an

Trouble deploying spark program because of soft link?

2015-01-30 Thread suhshekar52
Sorry if this is a double post...I'm not sure if I can send from email or have to come to the user list to create a new topic. A bit confused on this one...I have set up the KafkaWordCount found here:

Re: Serialized task result size exceeded

2015-01-30 Thread Charles Feduke
Are you using the default Java object serialization, or have you tried Kryo yet? If you haven't tried Kryo please do and let me know how much it impacts the serialization size. (I know its more efficient, I'm curious to know how much more efficient, and I'm being lazy - I don't have ~6K 500MB

Why is DecimalType separate from DataType ?

2015-01-30 Thread Manoj Samel
Spark 1.2 While building schemaRDD using StructType xxx = new StructField(credit_amount, DecimalType, true) gives error type mismatch; found : org.apache.spark.sql.catalyst.types.DecimalType.type required: org.apache.spark.sql.catalyst.types.DataType From

Re: Why is DecimalType separate from DataType ?

2015-01-30 Thread Michael Armbrust
You are grabbing the singleton, not the class. You need to specify the precision (i.e. DecimalType.Unlimited or DecimalType(precision, scale)) On Fri, Jan 30, 2015 at 2:23 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 While building schemaRDD using StructType xxx = new

Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Stephen Boesch
Theoretically your approach would require less overhead - i.e. a collect on the driver is not required as the last step. But maybe the difference is small and that particular path may or may not have been properly optimized vs the count(). Do you have a biggish data set to compare the timings?

KafkaWordCount

2015-01-30 Thread Eduardo Costa Alfaia
Hi Guys, I would like to put in the kafkawordcount scala code the kafka parameter: val kafkaParams = Map(“fetch.message.max.bytes” - “400”). I’ve put this variable like this val KafkaDStreams = (1 to numStreams) map {_ =

Spark streaming - tracking/deleting processed files

2015-01-30 Thread ganterm
We are running a Spark streaming job that retrieves files from a directory (using textFileStream). One concern we are having is the case where the job is down but files are still being added to the directory. Once the job starts up again, those files are not being picked up (since they are not

Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
Define not working. Not compiling? If so you need: import org.apache.spark.SparkContext._ On Fri Jan 30 2015 at 3:21:45 PM Amit Behera amit.bd...@gmail.com wrote: hi all, my sbt file is like this: name := Spark version := 1.0 scalaVersion := 2.10.4 libraryDependencies +=

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
Hi Andrew, Here's a note from the doc for sequenceFile: * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD will create many references to the same object. * If you plan to directly cache Hadoop

pyspark and and page allocation failures due to memory fragmentation

2015-01-30 Thread Antony Mayi
Hi, When running big mapreduce operation with pyspark (in the particular case using lot of sets and operations on sets in the map tasks so likely to be allocating and freeing loads of pages) I eventually get kernel error 'python: page allocation failure: order:10, mode:0x2000d0' plus very

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Are you using SGD for logistic regression? There's a random element there too, by nature. I looked into the code and see that you can't set a seed, but actually, the sampling is done with a fixed seed per partition anyway. Hm. In general you would not expect these algorithms to produce the same

groupByKey is not working

2015-01-30 Thread Amit Behera
hi all, my sbt file is like this: name := Spark version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.1.0 libraryDependencies += net.sf.opencsv % opencsv % 2.3 *code:* object SparkJob { def pLines(lines:Iterator[String])={ val parser=new

Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
Filed https://issues.apache.org/jira/browse/SPARK-5500 for this. -Sandy On Fri, Jan 30, 2015 at 11:59 AM, Aaron Davidson ilike...@gmail.com wrote: Ah, this is in particular an issue due to sort-based shuffle (it was not the case for hash-based shuffle, which would immediately serialize each

Re: Define size partitions

2015-01-30 Thread Sven Krasser
You can also use your InputFormat/RecordReader in Spark, e.g. using newAPIHadoopFile. See here: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext . -Sven On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Hi, I want to process

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Aniket Bhatnagar
Right. Which makes me to believe that the directory is perhaps configured somewhere and i have missed configuring the same. The process that is submitting jobs (basically becomes driver) is running in sudo mode and the executors are executed by YARN. The hadoop username is configured as 'hadoop'

[hive context] Unable to query array once saved as parquet

2015-01-30 Thread Ayoub
Hello, I have a problem when querying, with a hive context on spark 1.2.1-snapshot, a column in my table which is nested data structure like an array of struct. The problems happens only on the table stored as parquet, while querying the Schema RDD saved, as a temporary table, don't lead to any

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Michael Armbrust
Is it possible that your schema contains duplicate columns or column with spaces in the name? The parquet library will often give confusing error messages in this case. On Fri, Jan 30, 2015 at 10:33 AM, Ayoub benali.ayoub.i...@gmail.com wrote: Hello, I have a problem when querying, with a

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Thanks. I did specify a seed parameter. Seems that the problem is not caused by kFold. I actually ran another experiment without cross validation. I just built a model with the training data and then tested the model on the test data. However, the accuracy still varies from one run to another.

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Ayoub
No it is not the case, here is the gist to reproduce the issue https://gist.github.com/ayoub-benali/54d6f3b8635530e4e936 On Jan 30, 2015 8:29 PM, Michael Armbrust mich...@databricks.com wrote: Is it possible that your schema contains duplicate columns or column with spaces in the name? The

Re: groupByKey is not working

2015-01-30 Thread Arush Kharbanda
Hi Amit, What error does it through? Thanks Arush On Sat, Jan 31, 2015 at 1:50 AM, Amit Behera amit.bd...@gmail.com wrote: hi all, my sbt file is like this: name := Spark version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.1.0

Re: Define size partitions

2015-01-30 Thread Rishi Yadav
if you are only concerned about big partition size you can specify number of partitions as an additional parameter while loading files form hdfs. On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser kras...@gmail.com wrote: You can also use your InputFormat/RecordReader in Spark, e.g. using

Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread andrew.rowson
I've found a strange issue when trying to sort a lot of data in HDFS using spark 1.2.0 (CDH5.3.0). My data is in sequencefiles and the key is a class that derives from BytesWritable (the value is also a BytesWritable). I'm using a custom KryoSerializer to serialize the underlying byte array

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced this issue locally, it should be bug, but I don't think it's related to SPARK-5236. Will investigate this soon. Ayoub - would you mind to help to file a JIRA for this issue? Thanks! Cheng On 1/30/15 11:28 AM, Michael

Re: Spark SQL - Unable to use Hive UDF because of ClassNotFoundException

2015-01-30 Thread Marcelo Vanzin
Hi Capitão, Since you're using CDH, your question is probably more appropriate for the cdh-u...@cloudera.org list. The problem you're seeing is most probably an artifact of the way CDH is currently packaged. You have to add Hive jars manually to you Spark app's classpath if you want to use the

Error Compiling

2015-01-30 Thread Eduardo Costa Alfaia
Hi Guys, some idea how solve this error [error] /sata_disk/workspace/spark-1.1.1/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala:76: missing parameter type for expanded function ((x$6, x$7) = x$6.$plus(x$7))

Long pauses after writing to sequence files

2015-01-30 Thread Corey Nolet
We have a series of spark jobs which run in succession over various cached datasets, do small groups and transforms, and then call saveAsSequenceFile() on them. Each call to save as a sequence file appears to have done its work, the task says it completed in xxx.x seconds but then it pauses

Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
Yeah, from an unscientific test, it looks like the time to cache the blocks still dominates. Saving the count is probably a win, but not big. Well, maybe good to know. On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch java...@gmail.com wrote: Theoretically your approach would require less

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced this issue locally, it should be bug, but I don't think it's related to SPARK-5236. Will investigate this soon. Ayoub - would you mind to help to file a JIRA for this issue? Thanks! Cheng On 1/30/15 11:28 AM, Michael

measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each stage? Thanks, Josh

Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-01-30 Thread Milad khajavi
Here is the same issues: [1] http://stackoverflow.com/questions/28186607/java-lang-classcastexception-using-lambda-expressions-in-spark-job-on-remote-ser [2]

Re: Build error

2015-01-30 Thread Tathagata Das
That is a known issue uncovered last week. It fails on certain environments, not on Jenkins which is our testing environment. There is already a PR up to fix it. For now you can build using mvn package -DskipTests TD On Fri, Jan 30, 2015 at 8:59 PM, Andrew Musselman andrew.mussel...@gmail.com

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-30 Thread Cheng Lian
Yeah, currently there isn't such a repo. However, the Spark team is working on this. Cheng On 1/30/15 8:19 AM, Ayoub wrote: I am not personally aware of a repo for snapshot builds. In my use case, I had to build spark 1.2.1-snapshot see

Build error

2015-01-30 Thread Andrew Musselman
Off master, got this error; is that typical? --- T E S T S --- Running org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.495

Re: Long pauses after writing to sequence files

2015-01-30 Thread Akhil Das
Not quiet sure, but it could be the GC Pause, if you are holding too much objects in memory. You can check this tuning http://spark.apache.org/docs/1.2.0/tuning.html part if you haven't already been through it. Thanks Best Regards On Sat, Jan 31, 2015 at 7:22 AM, Corey Nolet cjno...@gmail.com

Re: measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Akhil Das
I believe From the webui (running on port 8080) you will get these measurements. Thanks Best Regards On Sat, Jan 31, 2015 at 9:29 AM, Josh J joshjd...@gmail.com wrote: Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each

Re: Error Compiling

2015-01-30 Thread Akhil Das
This is how i do it: val tmp = test.map(x = (x, 1L)).reduceByWindow({ case ((word1, count1), (word2, count2)) = (word1 + + word2, count1 + count2)}, Seconds(10), Seconds(10)) In your case you are actually having a type mismatch: [image: Inline image 1] Thanks Best Regards On Sat, Jan 31,

Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
assuming the data can be partitioned then you have many timeseries for which you want to detect potential gaps. also assuming the resulting gaps info per timeseries is much smaller data then the timeseries data itself, then this is a classical example to me of a sorted (streaming) foldLeft,

[Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Yifan LI
Hi, I am running my graphx application on Spark 1.2.0(11 nodes cluster), has requested 30GB memory per node and 100 cores for around 1GB input dataset(5 million vertices graph). But the error below always happen… Is there anyone could give me some points? (BTW, the overall

[Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Yifan LI
Hi, I am running my graphx application on Spark 1.2.0(11 nodes cluster), has requested 30GB memory per node and 100 cores for around 1GB input dataset(5 million vertices graph). But the error below always happen… Is there anyone could give me some points? (BTW, the overall edge/vertex RDDs

Define size partitions

2015-01-30 Thread Guillermo Ortiz
Hi, I want to process some files, there're a king of big, dozens of gigabytes each one. I get them like a array of bytes and there's an structure inside of them. I have a header which describes the structure. It could be like: Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..

RE: unknown issue in submitting a spark job

2015-01-30 Thread Sean Owen
You should not disable the GC overhead limit. How does increasing executor total memory cause you to not have enough memory? Do you mean something else? On Jan 30, 2015 1:16 AM, ey-chih chow eyc...@hotmail.com wrote: I use the default value, which I think is 512MB. If I change to 1024MB, Spark

Re: KMeans with large clusters Java Heap Space

2015-01-30 Thread derrickburns
By default, HashingTF turns each document into a sparse vector in R^(2^20), i.e. a million dimensional space. The current Spark clusterer turns each sparse into a dense vector with a million entries when it is added to a cluster. Hence, the memory needed grows as the number of clusters times 8M

Re: Hi: hadoop 2.5 for spark

2015-01-30 Thread fightf...@163.com
Hi, Siddharth You can re build spark with maven by specifying -Dhadoop.version=2.5.0 Thanks, Sun. fightf...@163.com From: Siddharth Ubale Date: 2015-01-30 15:50 To: user@spark.apache.org Subject: Hi: hadoop 2.5 for spark Hi , I am beginner with Apache spark. Can anyone let me know if it

Re: Hi: hadoop 2.5 for spark

2015-01-30 Thread bit1...@163.com
You can use prebuilt version that is built upon hadoop2.4. From: Siddharth Ubale Date: 2015-01-30 15:50 To: user@spark.apache.org Subject: Hi: hadoop 2.5 for spark Hi , I am beginner with Apache spark. Can anyone let me know if it is mandatory to build spark with the Hadoop version I am

Re: spark with cdh 5.2.1

2015-01-30 Thread Sean Owen
There is no need for a 2.5 profile. The hadoop-2.4 profile is for Hadoop 2.4 and beyond. You can set the particular version you want with -Dhadoop.version= You do not need to make any new profile to compile vs 2.5.0-cdh5.2.1. Again, the hadoop-2.4 profile is what you need. On Thu, Jan 29, 2015

HBase Thrift API Error on map/reduce functions

2015-01-30 Thread mtheofilos
I get a serialization problem trying to run Python: sc.parallelize(['1','2']).map(lambda id: client.getRow('table', id, None)) cloudpickle.py can't pickle method_descriptor type I add a function to pickle a method descriptor and now it exceeds the recursion limit I print the method name before i

Re: Building Spark behind a proxy

2015-01-30 Thread Arush Kharbanda
Hi Somya, I meant when you configure the JAVA_OPTS and when you don't configure the JAVA_OPTS is there any difference in the error message? Are you facing the same issue when you built using maven? Thanks Arush On Thu, Jan 29, 2015 at 10:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote:

WARN NativeCodeLoader warning in spark shell

2015-01-30 Thread kundan kumar
Hi, Whenever I start spark shell I get this warning. WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Whats the meaning of this and does/how can it impact the execution of my spark jobs ? Please suggest how can I fix

Re: WARN NativeCodeLoader warning in spark shell

2015-01-30 Thread Sean Owen
This is ignorable, and a message from Hadoop, which basically means what it says. It's almost infamous; search Google. You don't have to do anything. On Fri, Jan 30, 2015 at 1:04 PM, kundan kumar iitr.kun...@gmail.com wrote: Hi, Whenever I start spark shell I get this warning. WARN

Re: HW imbalance

2015-01-30 Thread Michael Segel
Sorry, but I think there’s a disconnect. When you launch a job under YARN on any of the hadoop clusters, the number of mappers/reducers is not set and is dependent on the amount of available resources. So under Ambari, CM, or MapR’s Admin, you should be able to specify the amount of