spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Antony Mayi
Hi, I am using newAPIHadoopRDD to load RDD from hbase (using pyspark running as yarn-client) - pretty much the standard case demonstrated in the hbase_inputformat.py from examples... the thing is the when trying the very same code on spark 1.2 I am getting the error bellow which based on

Trouble with large Yarn job

2015-01-07 Thread Anders Arpteg
Hey, I have a job that keeps failing if too much data is processed, and I can't see how to get it working. I've tried repartitioning with more partitions and increasing amount of memory for the executors (now about 12G and 400 executors. Here is a snippets of the first part of the code, which

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Raghavendra Pandey
You can also use join function of rdd. This is actually kind of append funtion that add up all the rdds and create one uber rdd. On Wed, Jan 7, 2015, 14:30 rkgurram rkgur...@gmail.com wrote: Thank you for the response, sure will try that out. Currently I changed my code such that the first

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-07 Thread Nick Pentreath
As I recall Oryx (the old version, and I assume the new one too) provide something like this: http://cloudera.github.io/oryx/apidocs/com/cloudera/oryx/als/common/OryxRecommender.html#recommendToAnonymous-java.lang.String:A-float:A-int- though Sean will be more on top of that than me :) On Mon,

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-07 Thread Wouter Samaey
You’re right, Nick! This function does exactly that. Sean has already helped me greatly. Thanks for your reply. Wouter Samaey Zaakvoerder Storefront BVBA Tel: +32 472 72 83 07 Web: http://storefront.be LinkedIn: http://www.linkedin.com/in/woutersamaey On 07 Jan 2015, at 11:08, Nick

Re: Parquet schema changes

2015-01-07 Thread Adam Gilmore
Fantastic - glad to see that it's in the pipeline! On Wed, Jan 7, 2015 at 11:27 AM, Michael Armbrust mich...@databricks.com wrote: I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851 On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread rkgurram
Thank you for the response, sure will try that out. Currently I changed my code such that the first map files.map to files.flatMap, which I guess will do similar what you are saying, it gives me a List[] of elements (in this case LabeledPoints, I could also do RDDs) which I then turned into a

example insert statement in Spark SQL

2015-01-07 Thread Niranda Perera
Hi, Are insert statements supported in Spark? if so, can you please give me an example? Rgds -- Niranda

Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Xuelin Cao
Hi,        Curious and curious. I'm puzzled by the Spark SQL cached table.       Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL.       However, in my test, I always see the whole table is scanned even though I only select one column in

spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-07 Thread Aniket Bhatnagar
It seems that spark-network-yarn compiled for scala 2.11 depends on spark-network-shuffle compiled for scala 2.10. This causes cross version dependencies conflicts in sbt. Seems like a publishing error? http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR

Re: disable log4j for spark-shell

2015-01-07 Thread Akhil
Edit your conf/log4j.properties file and Change the following line: log4j.rootCategory=INFO, console to log4j.rootCategory=ERROR, console Another approach would be to : Fireup spark-shell and type in the following: import org.apache.log4j.Logger import

Re: Reading Data Using TextFileStream

2015-01-07 Thread Akhil Das
How about the following code? I'm not quiet sure what you were doing inside the flatmap and foreach. import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Sean Owen
I think you mean union(). Yes, you could also simply make an RDD for each file, and use SparkContext.union() to put them together. On Wed, Jan 7, 2015 at 9:51 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: You can also use join function of rdd. This is actually kind of append

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Sean Owen
Problems like this are always due to having code compiled for Hadoop 1.x run against Hadoop 2.x, or vice versa. Here, you compiled for 1.x but at runtime Hadoop 2.x is used. A common cause is actually bundling Spark / Hadoop classes with your app, when the app should just use the Spark / Hadoop

Re: [MLLib] storageLevel in ALS

2015-01-07 Thread Fernando O.
1.2 I run() you have usersOut.setName(usersOut).persist(StorageLevel.MEMORY_AND_DISK) productsOut.setName(productsOut).persist(StorageLevel.MEMORY_AND_DISK) On Wed, Jan 7, 2015, 02:11 Xiangrui Meng men...@gmail.com wrote: Which Spark version are you using? We made this configurable

[GraphX] Integration with TinkerPop3/Gremlin

2015-01-07 Thread Nicolas Colson
Hi Spark/GraphX community, I'm wondering if you have TinkerPop3/Gremlin on your radar? (github https://github.com/tinkerpop/tinkerpop3, doc http://www.tinkerpop.com/docs/3.0.0-SNAPSHOT) They've done an amazing work refactoring their stack recently and Gremlin is a very nice DSL to work with

max receiving rate in spark streaming

2015-01-07 Thread Hafiz Mujadid
Hi experts! Is there any way to decide what can be effective receiving rate for kafka spark streaming? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/max-receiving-rate-in-spark-streaming-tp21013.html Sent from the Apache Spark User List mailing

TestSuiteBase based unit test using a sliding window join timesout

2015-01-07 Thread Enno Shioji
Hi, I extended org.apache.spark.streaming.TestSuiteBase for some testing, and I was able to run this test fine: test(Sliding window join with 3 second window duration) { val input1 = Seq( Seq(req1), Seq(req2, req3), Seq(), Seq(req4, req5, req6), Seq(req7),

Re: max receiving rate in spark streaming

2015-01-07 Thread Akhil Das
If you are using the Lowlevel consumer https://github.com/dibbhatt/kafka-spark-consumer then you have an option to tweak the rate by setting *_fetchSizeBytes https://github.com/dibbhatt/kafka-spark-consumer/blob/master/src/main/java/consumer/kafka/KafkaConfig.java#L37 *value. Default is 64kb, you

Re: Reading Data Using TextFileStream

2015-01-07 Thread Akhil Das
You need to put some files in the location *(/user/huser/user/huser/flume)* once the job starts running, then only it will print. also note i missed the / in the above code. Thanks Best Regards On Wed, Jan 7, 2015 at 4:42 PM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi Akhil,

About logistic regression sample codes in pyspark

2015-01-07 Thread cedric.artigue
Hi all, Recently I played a little bit with both naive and mllib python sample codes for logistic regression. In short I wanted to compare naive and non naive logistic regression results using same input weights and data. So, I modified slightly both sample codes to use the same initial weights

KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Mukesh Jha
Hi Guys, I have a kafka topic having 90 partitions and I running SparkStreaming(1.2.0) to read from kafka via KafkaUtils to create 10 kafka-receivers. My streaming is running fine and there is no delay in processing, just that some partitions data is never getting picked up. From the kafka

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Gerard Maas
Hi, Could you add the code where you create the Kafka consumer? -kr, Gerard. On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote: Hi Mukesh, If my understanding is correct, each Stream only has a single Receiver. So, if you have each receiver consuming 9 partitions, you

FW: No APPLICATION_COMPLETE file created in history server log location upon pyspark job success

2015-01-07 Thread michael.england
Hi, I am currently running pyspark jobs against Spark 1.1.0 on YARN. When I run example Java jobs such as spark-pi, the following files get created: bash-4.1$ tree spark-pi-1420624364958 spark-pi-1420624364958 âââ APPLICATION_COMPLETE âââ EVENT_LOG_1 âââ SPARK_VERSION_1.1.0 0 directories, 3

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread francois . garillot
Hi Mukesh, If my understanding is correct, each Stream only has a single Receiver. So, if you have each receiver consuming 9 partitions, you need 10 input DStreams to create 10 concurrent receivers:

Re: disable log4j for spark-shell

2015-01-07 Thread Asim Jalis
Another option is to make a copy of log4j.properties in the current directory where you start spark-shell from, and modify log4j.rootCategory=INFO, console to log4j.rootCategory=ERROR, console. Then start the shell. On Wed, Jan 7, 2015 at 3:39 AM, Akhil ak...@sigmoidanalytics.com wrote: Edit

akka.actor.ActorNotFound In Spark Streaming on Mesos (using ssc.actorStream)

2015-01-07 Thread Christophe Billiard
Hi all, I am trying to run this example on mesos: https://github.com/jaceklaskowski/spark-activator#master https://github.com/jaceklaskowski/spark-activator#master I have mesos 0.21.0 (instead of 0.18.1, could that be a problem?) I download spark pre built package spark-1.2.0-bin-hadoop2.4.tgz

Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together based on some predefined business cases. After updating to 1.2.0, some of the concurrency expectations about how the stages within jobs are executed

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-07 Thread Cody Koeninger
But Xuelin already posted in the original message that the code was using SET spark.sql.parquet.filterPushdown=true On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com wrote: Quoting Michael: Predicate push down into the input format is turned off by default because there is

Spark Trainings/ Professional certifications

2015-01-07 Thread Saurabh Agrawal
Hi, Can you please suggest some of the best available trainings/ coaching and professional certifications in Apache Spark? We are trying to run predictive analysis on our Sales data and come out with recommendations (leads). We have tried to run CF but we end up getting absolutely bogus

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2015-01-07 Thread asimjalis
Hi Patrick: Do you know what the status of this issue is? Is there a JIRA that is tracking this issue? Thanks. Asim Patrick Wendell writes: Within a partition things will spill - so the current documentation is correct. This spilling can only occur *across keys* at the moment. Spilling cannot

Re: Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread Tobias Pfeiffer
Hi, On Thu, Jan 8, 2015 at 2:19 PM, rekt...@voodoowarez.com wrote: dstream processing bulk HDFS data- is something I don't feel is super well socialized yet, fingers crossed that base gets built up a little more. Just out of interest (and hoping not to hijack my own thread), why are you

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-07 Thread Xuelin Cao
Yes, the problem is, I've turned the flag on. One possible reason for this is, the parquet file supports predicate pushdown by setting statistical min/max value of each column on parquet blocks. If in my test, the groupID=10113000 is scattered in all parquet blocks, then the predicate pushdown

Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Hi, Currently, we are building up a middle scale spark cluster (100 nodes) in our company. One thing bothering us is, the how spark manages the resource (CPU, memory). I know there are 3 resource management modes: stand-along, Mesos, Yarn In the stand along mode, the cluster

Re: Can spark supports task level resource management?

2015-01-07 Thread Tim Chen
Hi Xuelin, I can only speak about Mesos mode. There are two modes of management in Spark's Mesos scheduler, which are fine-grain mode and coarse-grain mode. In fine grain mode, each spark task launches one or more spark executors that only live through the life time of the task. So it's

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Shixiong Zhu
I have not used CDH5.3.0. But looks spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar contains some hadoop1 jars (come from a wrong hbase version). I don't know the recommanded way to build spark-examples jar because the official Spark docs does not mention how to build spark-examples jar.

Eclipse flags error on KafkaUtils.createStream()

2015-01-07 Thread Kah-Chan Low
Hi,I am using Eclipse writing Java code. I am trying to create a Kafka receiver by: JavaPairReceiverInputDStreamString, kafka.message.Message a  = KafkaUtils.createStream(jssc, String.class, Message.class,     StringDecoder.class, DefaultDecoder.class, kafkaParams,

RE: MatrixFactorizationModel serialization

2015-01-07 Thread Ganelin, Ilya
Try loading features as Val userfeatures = sc.objectFile(path1) Val productFeatures = sc.objectFile(path2) And then call the constructor of the MatrixFsgtorizationModel with those. Sent with Good (www.good.com) -Original Message- From: wanbo [gewa...@163.commailto:gewa...@163.com]

Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread Tobias Pfeiffer
Hi, I have a setup where data from an external stream is piped into Kafka and also written to HDFS periodically for long-term storage. Now I am trying to build an application that will first process the HDFS files and then switch to Kafka, continuing with the first item that was not yet in HDFS.

Re: Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread rektide
I've started 1 or 2 emails to ask more broadly- what are good practices for doing DStream computations in a non-realtime fashion? I'd love to have a good intro article to pass around to people, and some resources for those others chasing this problem. Back when I was working with Storm, managing

Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Hi, Thanks for the information. One more thing I want to clarify, when does Mesos or Yarn allocate and release the resource? Aka, what is the resource life time? For example, in the stand-along mode, the resource is allocated when the application is launched, resource released

Re: Can spark supports task level resource management?

2015-01-07 Thread Tim Chen
In coarse grain mode, the spark executors are launched and kept running while the scheduler is running. So if you have a spark shell launched and remained open, the executors are running and won't finish until the shell is exited. In fine grain mode, the overhead time mostly comes from

[Spare Core] function SparkContext.cancelJobGroup(groupId) doesn't work

2015-01-07 Thread Tao Li
Hi all: In my application, I start SparkContext sc and execute some task on sc. (Each task is a thread, which execute some transform and action on RDDs) For each task, I use sc.setJobGroup(JOB_GROUPID, JOB_DESCRIPTION) to set jobGroup for each task. But when I call

RE: Spark with Hive cluster dependencies

2015-01-07 Thread Somnath Pandeya
You can follow the below the link also. It works on stand alone spark cluster. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started thanks Somnath From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, January 08, 2015 2:21 AM To: jamborta Cc: user

Re: MatrixFactorizationModel serialization

2015-01-07 Thread wanbo
I save and reload model like this: val bestModel = ALS.train(training, rank, numIter, lambda) bestModel.get.userFeatures.saveAsObjectFile(hdfs://***:9000/spark/results/userfeatures) bestModel.get.productFeatures.saveAsObjectFile(hdfs://***:9000/spark/results/productfeatures) val bestModel =

Re: Create DStream consisting of HDFS and (then) Kafka data

2015-01-07 Thread rektide
On Thu, Jan 08, 2015 at 02:33:30PM +0900, Tobias Pfeiffer wrote: Hi, On Thu, Jan 8, 2015 at 2:19 PM, rekt...@voodoowarez.com wrote: dstream processing bulk HDFS data- is something I don't feel is super well socialized yet, fingers crossed that base gets built up a little more.

Re: Can spark supports task level resource management?

2015-01-07 Thread Sandy Ryza
Hi Xuelin, Spark 1.2 includes a dynamic allocation feature that allows Spark on YARN to modulate its YARN resource consumption as the demands of the application grow and shrink. This is somewhat coarser than what you call task-level resource management. Elasticity comes through allocating and

Re: Can spark supports task level resource management?

2015-01-07 Thread Xuelin Cao
Got it, thanks. On Thu, Jan 8, 2015 at 3:30 PM, Tim Chen t...@mesosphere.io wrote: In coarse grain mode, the spark executors are launched and kept running while the scheduler is running. So if you have a spark shell launched and remained open, the executors are running and won't finish until

[MLlib] Scoring GBTs with a variable number of trees

2015-01-07 Thread Christopher Thom
Hi All, I wonder if anyone has any experience with building Gradient Boosted Tree models in MLlib, and can help me out. I'm trying to create a plot of the test error rate of a Gradient Boosted Tree model as a function of number of trees, to determine the optimal number of trees in the model.

spark 1.1 got error when working with cdh5.3.0 standalone mode

2015-01-07 Thread freedafeng
Hi, I installed the cdh5.3.0 core+Hbase in a new ec2 cluster. Then I manually installed spark1.1 in it. but when I started the slaves, I got an error as follows, ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 Error: Could not find or load main class

Re: spark 1.1 got error when working with cdh5.3.0 standalone mode

2015-01-07 Thread Marcelo Vanzin
This could be cause by many things including wrong configuration. Hard to tell with just the info you provided. Is there any reason why you want to use your own Spark instead of the one shipped with CDH? CDH 5.3 has Spark 1.2, so unless you really need to run Spark 1.1, you should be better off

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
Sorry- replace ### with an actual number. What does a skipped stage mean? I'm running a series of jobs and it seems like after a certain point, the number of skipped stages is larger than the number of actual completed stages. On Wed, Jan 7, 2015 at 3:28 PM, Ted Yu yuzhih...@gmail.com wrote:

Spark with Cassandra - Shuffle opening to many files

2015-01-07 Thread Ankur Srivastava
Hello, We are currently running our data pipeline on spark which uses Cassandra as the data source. We are currently facing issue with the step where we create an rdd on data in cassandra table and then try to run flatMapToPair to transform the data but we are running into Too many open files. I

Re: Spark with Hive cluster dependencies

2015-01-07 Thread Michael Armbrust
Have you looked at Spark SQL http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables? It supports HiveQL, can read from the hive metastore, and does not require hadoop. On Wed, Jan 7, 2015 at 8:27 AM, jamborta jambo...@gmail.com wrote: Hi all, We have been building a system

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Shivaram Venkataraman
+Josh, who added the Job UI page. I've seen this as well and was a bit confused about what it meant. Josh, is there a specific scenario that creates these skipped stages in the Job UI ? Thanks Shivaram On Wed, Jan 7, 2015 at 12:32 PM, Corey Nolet cjno...@gmail.com wrote: Sorry- replace ###

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2015-01-07 Thread Andrew Or
Did you end up getting it working? By the way this might be a nicer view of the docs: https://github.com/apache/spark/blob/60e2d9e2902b132b14191c9791c71e8f0d42ce9d/docs/job-scheduling.md We will update the latest Spark docs to include this shortly. -Andrew 2015-01-04 4:44 GMT-08:00 Tsuyoshi

What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Ted Yu
Looks like the number of skipped stages couldn't be formatted. Cheers On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: We just upgraded to Spark 1.2.0 and we're seeing this in the UI.

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Mukesh Jha
I understand that I've to create 10 parallel streams. My code is running fine when the no of partitions is ~20, but when I increase the no of partitions I keep getting in this issue. Below is my code to create kafka streams, along with the configs used. MapString, String kafkaConf = new

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread francois . garillot
- You are launching up to 10 threads/topic per Receiver. Are you sure your receivers can support 10 threads each ? (i.e. in the default configuration, do they have 10 cores). If they have 2 cores, that would explain why this works with 20 partitions or less. - If you have 90 partitions, why

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Sean Owen
Yes, the distribution is certainly fine and built for Hadoop 2. It sounds like you are inadvertently including Spark code compiled for Hadoop 1 when you run your app. The general idea is to use the cluster's copy at runtime. Those with more pyspark experience might be able to give more useful

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-07 Thread Laeeq Ahmed
Hi, I applied it as fallows:    eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) - 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)val counts = eegStreams(a).map(x = math.round(x.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4)) val sortedCounts =

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread Michael Armbrust
The cache command caches the entire table, with each column stored in its own byte buffer. When querying the data, only the columns that you are asking for are scanned in memory. I'm not sure what mechanism spark is using to report the amount of data read. If you want to read only the data that

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Shivaram Venkataraman
Ah I see - So its more like 're-used stages' which is not necessarily a bug in the program or something like that. Thanks for the pointer to the comment Thanks Shivaram On Wed, Jan 7, 2015 at 2:00 PM, Mark Hamstra m...@clearstorydata.com wrote: That's what you want to see. The computation of

Re: Spark with Cassandra - Shuffle opening to many files

2015-01-07 Thread Cody Koeninger
General ideas regarding too many open files: Make sure ulimit is actually being set, especially if you're on mesos (because of https://issues.apache.org/jira/browse/MESOS-123 ) Find the pid of the executor process, and cat /proc/pid/limits set spark.shuffle.consolidateFiles = true try

ScalaReflectionException when using saveAsParquetFile in sbt

2015-01-07 Thread figpope
I'm on Spark 1.2.0, with Scala 1.11.2, and SBT 0.13.7. When running: case class Test(message: String) val sc = new SparkContext(local, shell) val sqlContext = new SQLContext(sc) import sqlContext._ val testing = sc.parallelize(List(Test(this), Test(is), Test(a), Test(test)))

Spark Streaming with Listening Server Socket

2015-01-07 Thread gangli72
I'm new to Spark Streaming. From the programming guide I saw there is this JavaStreamingContext.socketTextStream() API that connects to a server and grab the content to process. My requirement is a slightly different: I used to have listening server that receives (not go out to grab) messages from

Re: Spark with Cassandra - Shuffle opening to many files

2015-01-07 Thread Ankur Srivastava
Thank you Cody!! I am going to try with the two settings you have mentioned. We are currently running with Spark standalone cluster manager. Thanks Ankur On Wed, Jan 7, 2015 at 1:20 PM, Cody Koeninger c...@koeninger.org wrote: General ideas regarding too many open files: Make sure ulimit

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Mark Hamstra
That's what you want to see. The computation of a stage is skipped if the results for that stage are still available from the evaluation of a prior job run: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L163 On Wed, Jan 7, 2015

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-07 Thread 曹雪林
Thanks Michael. 2015-01-08 6:04 GMT+08:00 Michael Armbrust mich...@databricks.com: The cache command caches the entire table, with each column stored in its own byte buffer. When querying the data, only the columns that you are asking for are scanned in memory. I'm not sure what mechanism

When will spark support push style shuffle?

2015-01-07 Thread 曹雪林
Hi, I've heard a lot of complain about spark's pull style shuffle. Is there any plan to support push style shuffle in the near future? Currently, the shuffle phase must be completed before the next stage starts. While, it is said, in Impala, the shuffled data is streamed to the next

Spark on teradata?

2015-01-07 Thread gen tang
Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen

RE: FW: No APPLICATION_COMPLETE file created in history server log location upon pyspark job success

2015-01-07 Thread michael.england
Thanks Andrew, simple fix ☺. From: Andrew Ash [mailto:and...@andrewash.com] Sent: 07 January 2015 15:26 To: England, Michael (IT/UK) Cc: user Subject: Re: FW: No APPLICATION_COMPLETE file created in history server log location upon pyspark job success Hi Michael, I think you need to

Re: Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
I asked this question too soon. I am caching off a bunch of RDDs in a TrieMap so that our framework can wire them together and the locking was not completely correct- therefore it was creating multiple new RDDs at times instead of using cached versions- which were creating completely separate

Spark History Server can't read event logs

2015-01-07 Thread michael.england
Hi, When I run jobs and save the event logs, they are saved with the permissions of the unix user and group that ran the spark job. The history server is run as a service account and therefore can’t read the files: Extract from the History server logs: 2015-01-07 15:37:24,3021 ERROR Client

Re: [MLLib] storageLevel in ALS

2015-01-07 Thread Fernando O.
I guess I can but it would be nicer if that is made a configuration, I can create the issue, test and PR if you guys think its appropiate On Wed, Jan 7, 2015 at 1:41 PM, Sean Owen so...@cloudera.com wrote: Ah, Fernando means the usersOut / productsOut RDDs, not the intermediate links RDDs.

streaming application throws IOException due to Log directory already exists during checkpoint recovery

2015-01-07 Thread Max Xu
Hi All, I run a Spark streaming application (Spark 1.2.0) on YARN (Hadoop 2.5.2) with Spark event log enabled. I set the checkpoint dir in the streaming context and run the app. It started in YARN with application id 'app_id_1' and created the Spark event log dir

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-07 Thread Laeeq Ahmed
Hi, It worked out as this. val topCounts = sortedCounts.transform(rdd = rdd.zipWithIndex().filter(x=x._2 =10)) Regards,Laeeq On Wednesday, January 7, 2015 5:25 PM, Laeeq Ahmed laeeqsp...@yahoo.com.INVALID wrote: Hi Yana, I also think thatval top10 = your_stream.mapPartitions(rdd =

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-07 Thread Akhil Das
Oh yeah. In that case you can simply repartition it into 1 and do mapPartition. val top10 = mysream.repartition(1).mapPartitions(rdd = rdd.take(10)) On 7 Jan 2015 21:08, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, I applied it as fallows: eegStreams(a) = KafkaUtils.createStream(ssc,

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-07 Thread Laeeq Ahmed
Hi Yana, I also think thatval top10 = your_stream.mapPartitions(rdd = rdd.take(10)) will give top 10 from each partition. I will try your code. Regards,Laeeq On Wednesday, January 7, 2015 5:19 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: My understanding is that  val top10 =

Re: [MLLib] storageLevel in ALS

2015-01-07 Thread Sean Owen
Ah, Fernando means the usersOut / productsOut RDDs, not the intermediate links RDDs. Can you unpersist() them, and persist() again at the desired level? the downside is that this might mean recomputing and repersisting the RDDs. On Wed, Jan 7, 2015 at 5:11 AM, Xiangrui Meng men...@gmail.com

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Gerard Maas
AFAIK, there're two levels of parallelism related to the Spark Kafka consumer: At JVM level: For each receiver, one can specify the number of threads for a given topic, provided as a map [topic - nthreads]. This will effectively start n JVM threads consuming partitions of that kafka topic. At

Spark with Hive cluster dependencies

2015-01-07 Thread jamborta
Hi all, We have been building a system where we heavily reply on hive queries executed through spark to load and manipulate data, running on CDH and yarn. I have been trying to explore lighter setups where we would not have to maintain a hadoop cluster, just run the system on spark only. Is it

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Antony Mayi
thanks, I found the issue, I was including  /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar into the classpath - this was breaking it. now using custom jar with just the python convertors and all works as a charm.thanks,antony. On Wednesday, 7 January 2015,

Re: Spark Trainings/ Professional certifications

2015-01-07 Thread Sean Owen
O'Reilly + Databricks does certification: http://www.oreilly.com/data/sparkcert Databricks does training: http://databricks.com/spark-training Cloudera does too: http://www.cloudera.com/content/cloudera/en/training/courses/spark-training.html That said, I am not sure you need a certificate to

Re: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-07 Thread Marcelo Vanzin
This particular case shouldn't cause problems since both of those libraries are java-only (the scala version appended there is just for helping the build scripts). But it does look weird, so it would be nice to fix it. On Wed, Jan 7, 2015 at 12:25 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com

Re: calculating the mean of SparseVector RDD

2015-01-07 Thread Xiangrui Meng
There is some serialization overhead. You can try https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 . -Xiangrui On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote: I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector.

Re: Spark Trainings/ Professional certifications

2015-01-07 Thread gen tang
Hi, I am sorry to bother you, but I couldn't find any information about online test of spark certification managed through Kryterion. Could you please give me the link about it? Thanks a lot in advance. Cheers Gen On Wed, Jan 7, 2015 at 6:18 PM, Paco Nathan cet...@gmail.com wrote: Hi

Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Sven Krasser
Could you try it on AWS using EMR? That'd give you an exact replica of the environment that causes the error. Sent from my iPhone On Jan 7, 2015, at 10:53 AM, Davies Liu dav...@databricks.com wrote: Hey Sven, I tried with all of your configurations, 2 node with 2 executors each, but in

Re: Spark Trainings/ Professional certifications

2015-01-07 Thread Paco Nathan
For online, use the http://www.oreilly.com/go/sparkcert link to sign up via O'Reilly. They will send details -- the announcement is being prepared. On Wed, Jan 7, 2015 at 10:56 AM, gen tang gen.tan...@gmail.com wrote: Hi, I am sorry to bother you, but I couldn't find any information about

Re: Spark History Server can't read event logs

2015-01-07 Thread Marcelo Vanzin
The Spark code generates the log directory with 770 permissions. On top of that you need to make sure of two things: - all directories up to /apps/spark/historyserver/logs/ are readable by the user running the history server - the user running the history server belongs to the group that owns

Re: Shuffle Problems in 1.2.0

2015-01-07 Thread Davies Liu
Hey Sven, I tried with all of your configurations, 2 node with 2 executors each, but in standalone mode, it worked fine. Could you try to narrow down the possible change of configurations? Davies On Tue, Jan 6, 2015 at 8:03 PM, Sven Krasser kras...@gmail.com wrote: Hey Davies, Here are some

Re: spark 1.2 defaults to MR1 class when calling newAPIHadoopRDD

2015-01-07 Thread Antony Mayi
this is official cloudera compiled stack cdh 5.3.0 - nothing has been done by me and I presume they are pretty good in building it so I still suspect it now gets the classpath resolved in different way? thx,Antony. On Wednesday, 7 January 2015, 18:55, Sean Owen so...@cloudera.com wrote:

Re: FW: No APPLICATION_COMPLETE file created in history server log location upon pyspark job success

2015-01-07 Thread Andrew Ash
Hi Michael, I think you need to explicitly call sc.stop() on the spark context for it to close down properly (this doesn't happen automatically). See https://issues.apache.org/jira/browse/SPARK-2972 for more details Andrew On Wed, Jan 7, 2015 at 3:38 AM, michael.engl...@nomura.com wrote:

Re: Spark Trainings/ Professional certifications

2015-01-07 Thread Paco Nathan
Hi Saurabh, In your area, Big Data Partnership provides Spark training: http://www.bigdatapartnership.com/ As Sean mentioned, there is a certification program via a partnership between O'Reilly Media and Databricks http://www.oreilly.com/go/sparkcert That is offered in two ways, in-person at

Join stucks in the last stage step

2015-01-07 Thread paja
Hello, I have problem with join of two tables via Spark - I have tried to do it via Spark SQL and API but no progress so far. I have basicaly two tables ACCONTS - 16 mio records and TRANSACTIONS 2,5 billion records. When I try to join the tables (please see code) the job stucks in the last

calculating the mean of SparseVector RDD

2015-01-07 Thread rok
I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector. I've tried doing this with the following (using pyspark, spark v1.2.0): def aggregate_partition_values(vec1, vec2) : vec1[vec2.indices] += vec2.values return vec1 def