Re: DataFrame Column Alias problem

2015-05-22 Thread SLiZn Liu
However this returns a single column of c, without showing the original col1 . ​ On Thu, May 21, 2015 at 11:25 PM Ram Sriharsha sriharsha@gmail.com wrote: df.groupBy($col1).agg(count($col1).as(c)).show On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu sliznmail...@gmail.com wrote: Hi Spark

Re: Issues with constants in Spark HiveQL queries

2015-05-22 Thread Skanda
Hi All, I'm facing the same problem with Spark 1.3.0 from cloudera cdh 5.4.x. Any luck solving the issue? Exception: Exception in thread main org.apache.spark.sql.AnalysisException: Unsupported language features in query: select * from everest_marts_test.hive_ql_test where

Re: LDA prediction on new document

2015-05-22 Thread Ken Geis
Dani, this appears to be addressed in SPARK-5567, scheduled for Spark 1.5.0. Ken On May 21, 2015, at 11:12 PM, user-digest-h...@spark.apache.org wrote: From: Dani Qiu zongmin@gmail.com Subject: LDA prediction on new document Date: May 21, 2015 at 8:48:40 PM PDT To:

Re: DataFrame Column Alias problem

2015-05-22 Thread SLiZn Liu
Despite the odd usage, it does the trick, thanks Reynold! On Fri, May 22, 2015 at 2:47 PM Reynold Xin r...@databricks.com wrote: In 1.4 it actually shows col1 by default. In 1.3, you can add col1 to the output, i.e. df.groupBy($col1).agg($col1, count($col1).as(c)).show() On Thu, May 21,

Re: Spark Memory management

2015-05-22 Thread Akhil Das
You can look at the logic for offloading data from Memory by looking at ensureFreeSpace https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L416 call. And dropFromMemory

Spark Memory management

2015-05-22 Thread swaranga
Experts, This is an academic question. Since Spark runs on the JVM, how it is able to do things like offloading RDDs from memory to disk when the data cannot fit into memory. How are the calculations performed? Does it use the methods availabe in the java.lang.Runtime class to get free/available

Re: Official Docker container for Spark

2015-05-22 Thread ??????
spark src have dockerfile ,you can use this -- Original -- From: tridib;tridib.sama...@live.com; Date: Fri, May 22, 2015 03:25 AM To: useruser@spark.apache.org; Subject: Official Docker container for Spark Hi, I am using spark 1.2.0. Can you

Re: [pyspark] Starting workers in a virtualenv

2015-05-22 Thread Karlson
That works, thank you! On 2015-05-22 03:15, Davies Liu wrote: Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson ksonsp...@siberie.de wrote:

Spark Streaming and Drools

2015-05-22 Thread Antonio Giambanco
Hi All, I'm deploying and architecture that uses flume for sending log information in a sink. Spark streaming read from this sink (pull strategy) e process al this information, during this process I would like to make some event processing. . . for example: Log appender writes information about

Re: Official Docker container for Spark

2015-05-22 Thread Ritesh Kumar Singh
Use this: sequenceiq/docker Here's a link to their github repo: docker-spark https://github.com/sequenceiq/docker-spark They have repos for other big data tools too which are agin really nice. Its being maintained properly by their devs and

Re: rdd.sample() methods very slow

2015-05-22 Thread Reynold Xin
You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i = Random.nextDouble() 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter = iter.take(10) } // take the first 10 elements out of each partition

Re: LDA prediction on new document

2015-05-22 Thread Dani Qiu
thanks, Ken but I am planning to use spark LDA in production. I cannot wait for the future release. At least, provide some workaround solution. PS : in SPARK-5567 https://issues.apache.org/jira/browse/SPARK-5567 , mentioned This will require inference but should be able to use the same code,

Re: 回复: How to use spark to access HBase with Security enabled

2015-05-22 Thread Ted Yu
Can you post the morning modified code ? Thanks On May 21, 2015, at 11:11 PM, donhoff_h 165612...@qq.com wrote: Hi, Thanks very much for the reply. I have tried the SecurityUtil. I can see from log that this statement executed successfully, but I still can not pass the

?????? ?????? How to use spark to access HBase with Security enabled

2015-05-22 Thread donhoff_h
Hi, My modified code is listed below, just add the SecurityUtil API. I don't know which propertyKeys I should use, so I make 2 my own propertyKeys to find the keytab and principal. object TestHBaseRead2 { def main(args: Array[String]) { val conf = new SparkConf() val sc = new

Re: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Gautam Bajaj
This is just a friendly ping, just to remind you of my query. Also, is there a possible explanation/example on the usage of AsyncRDDActions in Java ? On Thu, May 21, 2015 at 7:18 PM, Gautam Bajaj gautam1...@gmail.com wrote: I am received data at UDP port 8060 and doing processing on it using

?????? How to use spark to access HBase with Security enabled

2015-05-22 Thread donhoff_h
Hi, Thanks very much for the reply. I have tried the SecurityUtil. I can see from log that this statement executed successfully, but I still can not pass the authentication of HBase. And with more experiments, I found a new interesting senario. If I run the program with yarn-client mode, the

Re: DataFrame Column Alias problem

2015-05-22 Thread Reynold Xin
In 1.4 it actually shows col1 by default. In 1.3, you can add col1 to the output, i.e. df.groupBy($col1).agg($col1, count($col1).as(c)).show() On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu sliznmail...@gmail.com wrote: However this returns a single column of c, without showing the original

Re: Spark Memory management

2015-05-22 Thread ??????
in spark src this class org.apache.spark.deploy.worker.WorkerArguments def inferDefaultCores(): Int = { Runtime.getRuntime.availableProcessors() } def inferDefaultMemory(): Int = { val ibmVendor = System.getProperty(java.vendor).contains(IBM) var totalMb = 0 try { val bean =

MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread SparknewUser
I am new in MLlib and in Spark.(I use Scala) I'm trying to understand how LogisticRegressionWithLBFGS and LogisticRegressionWithSGD work. I usually use R to do logistic regressions but now I do it on Spark to be able to analyze Big Data. The model only returns weights and intercept. My problem

Re: HiveContext fails when querying large external Parquet tables

2015-05-22 Thread Andrew Otto
What is also strange is that this seems to work on external JSON data, but not Parquet. I’ll try to do more verification of that next week. On May 22, 2015, at 16:24, yana yana.kadiy...@gmail.com wrote: There is an open Jira on Spark not pushing predicates to metastore. I have a large

RE: HiveContext fails when querying large external Parquet tables

2015-05-22 Thread yana
There is an open Jira on Spark not pushing predicates to metastore. I have a large dataset with many partitions but doing anything with it 8s very slow...But I am surprised Spark 1.2 worked for you: it has this problem... div Original message /divdivFrom: Andrew Otto

Re: Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Tathagata Das
Can you show us the rest of the program? When are you starting, or stopping the context. Is the exception occuring right after start or stop? What about log4j logs, what does it say? On Fri, May 22, 2015 at 7:12 AM, Cody Koeninger c...@koeninger.org wrote: I just verified that the following

HiveContext fails when querying large external Parquet tables

2015-05-22 Thread Andrew Otto
Hi all, (This email was easier to write in markdown, so I’ve created a gist with its contents here: https://gist.github.com/ottomata/f91ea76cece97444e269 https://gist.github.com/ottomata/f91ea76cece97444e269. I’ll paste the markdown content in the email body here too.) --- We’ve recently

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
If the message consumption rate is higher than the time required to process ALL data for a micro batch (ie the next RDD produced for your stream) the following happens – lets say that e.g. your micro batch time is 3 sec: 1. Based on your message streaming and consumption rate, you

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
… and measure 4 is to implement a custom Feedback Loop to e.g.to monitor the amount of free RAM and number of queued jobs and automatically decrease the message consumption rate of the Receiver until the number of clogged RDDs and Jobs subsides (again here you artificially decrease your

spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Todd Nist
I'm using the spark-cassandra-connector from DataStax in a spark streaming job launched from my own driver. It is connecting a a standalone cluster on my local box which has two worker running. This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT. I have added the following entry to

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread DB Tsai
Great to see the result comparable to R in new ML implementation. Since majority of users will still use the old mllib api, we plan to call the ML implementation from MLlib to handle the intercept correctly with regularization. JIRA is created. https://issues.apache.org/jira/browse/SPARK-7780

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread DB Tsai
In Spark 1.4, Logistic Regression with elasticNet is implemented in ML pipeline framework. Model selection can be achieved through high lambda resulting lots of zero in the coefficients. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On

spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-22 Thread Wang, Ningjun (LNG-NPV)
I used spark standalone cluster on Windows 2008. I kept on getting the following error when trying to save an RDD to a windows shared folder rdd.saveAsObjectFile(file:///T:/lab4-win02/IndexRoot01/tobacco-07/myrdd.obj) 15/05/22 16:49:05 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID

Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data from Kinesis at 15 second intervals using two streams (i.e. receivers). The job simply grabs the latest batch and pushes it to MongoDB. I believe that the problem is that all tasks are

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
I guess each receiver occupies a executor. So there was only one executor available for processing the job. On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data

Re: spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-22 Thread Ted Yu
The stack trace is related to hdfs. Can you tell us which hadoop release you are using ? Is this a secure cluster ? Thanks On Fri, May 22, 2015 at 1:55 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I used spark standalone cluster on Windows 2008. I kept on getting the

SparkSQL failing while writing into S3 for 'insert into table'

2015-05-22 Thread ogoh
Hello, I am using spark 1.3 Hive 0.13.1 in AWS. From Spark-SQL, when running Hive query to export Hive query result into AWS S3, it failed with the following message: == org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:

Dynamic Allocation with Spark Streaming

2015-05-22 Thread Saiph Kappa
Hi, 1. Dynamic allocation is currently only supported with YARN, correct? 2. In spark streaming, it is possible to change the number of executors while an application is running? If so, can the allocation be controlled by the application, instead of using any already defined automatic policy?

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Saiph Kappa
Sorry, but I can't see on TD's comments how to allocate executors on demand. It seems to me that he's talking about resources within an executor, mapping shards to cores. I want to be able to decommission executors/workers/machines. On Sat, May 23, 2015 at 3:31 AM, Ted Yu yuzhih...@gmail.com

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Saiph Kappa
Or should I shutdown the streaming context gracefully and then start it again with a different number of executors? On Sat, May 23, 2015 at 4:00 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Sorry, but I can't see on TD's comments how to allocate executors on demand. It seems to me that he's

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Ted Yu
For #1, the answer is yes. For #2, See TD's comments on SPARK-7661 Cheers On Fri, May 22, 2015 at 6:58 PM, Saiph Kappa saiph.ka...@gmail.com wrote: Hi, 1. Dynamic allocation is currently only supported with YARN, correct? 2. In spark streaming, it is possible to change the number of

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Yana Kadiyska
Todd, I don't have any answers for you...other than the file is actually named spark-defaults.conf (not sure if you made a typo in the email or misnamed the file...). Do any other options from that file get read? I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3

SparkSQL query plan to Stage wise breakdown

2015-05-22 Thread Pramod Biligiri
Hi, Is there an easy way to see how a SparkSQL query plan maps to different stages of the generated Spark job? The WebUI is entirely in terms of RDD stages and I'm having a hard time mapping it back to my query. Pramod

Re: Dynamic Allocation with Spark Streaming

2015-05-22 Thread Ted Yu
That should do. Cheers On Fri, May 22, 2015 at 8:28 PM, Saiph Kappa saiph.ka...@gmail.com wrote: Or should I shutdown the streaming context gracefully and then start it again with a different number of executors? On Sat, May 23, 2015 at 4:00 AM, Saiph Kappa saiph.ka...@gmail.com wrote:

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread tyronecai
may because of snappy-java, https://issues.apache.org/jira/browse/SPARK-5081 On May 23, 2015, at 1:23 AM, Josh Rosen rosenvi...@gmail.com wrote: I don't think that 0.9.3 has been released, so I'm assuming that you're running on branch-0.9. There's been over 4000 commits between 0.9.3 and

Re: Bigints in pyspark

2015-05-22 Thread Davies Liu
Could you show up the schema and confirm that they are LongType? df.printSchema() On Mon, Apr 27, 2015 at 5:44 AM, jamborta jambo...@gmail.com wrote: hi all, I have just come across a problem where I have a table that has a few bigint columns, it seems if I read that table into a dataframe

Re: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Tathagata Das
Something does not make sense. Receivers (currently) does not get blocked (unless rate limit has been set) due to processing load. The receiver will continue to receive data and store it in memory and until it is processed. So I am still not sure how the data loss is happening. Unless you are

Re: DataFrame groupBy vs RDD groupBy

2015-05-22 Thread Michael Armbrust
DataFrames have a lot more information about the data, so there is a whole class of optimizations that are possible there that we cannot do in RDDs. This is why we are focusing a lot of effort on this part of the project. In Spark 1.4 you can accomplish what you want using the new window function

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread Joseph Bradley
If you want to select specific variable combinations by hand, then you will need to modify the dataset before passing it to the ML algorithm. The DataFrame API should make that easy to do. If you want to have an ML algorithm select variables automatically, then I would recommend using L1

Application on standalone cluster never changes state to be stopped

2015-05-22 Thread Edward Sargisson
Hi, Environment: Spark standalone cluster running with a master and a work on a small Vagrant VM. The Jetty Webapp on the same node calls the spark-submit script to start the job. From the contents of the stdout I can see that it's running successfully. However, the spark-submit process never

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Evo Eftimov
A receiver occupies a cpu core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile div Original message /divdivFrom: Mike Trienis mike.trie...@orcsol.com

Re: FetchFailedException and MetadataFetchFailedException

2015-05-22 Thread Rok Roskar
on the worker/container that fails, the file not found is the first error -- the output below is from the yarn log. There were some python worker crashes for another job/stage earlier (see the warning at 18:36) but I expect those to be unrelated to this file not found error.

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
OR you can run Drools in a Central Server Mode ie as a common/shared service, but that would slowdown your Spark Streaming job due to the remote network call which will have to be generated for every single message From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
I am not aware of existing examples but you can always “ask” Google Basically from Spark Streaming perspective, Drools is a third-party Software Library, you would invoke it in the same way as any other third-party software library from the Tasks (maps, filters etc) within your DAG job

Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Guillermo Ortiz
Hi, I'm trying to connect to two topics of Kafka with Spark with DirectStream but I get an error. I don't know if there're any limitation to do it, because when I just access to one topics everything if right. *val ssc = new StreamingContext(sparkConf, Seconds(5))* *val kafkaParams =

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
You can deploy and invoke Drools as a Singleton on every Spark Worker Node / Executor / Worker JVM You can invoke it from e.g. map, filter etc and use the result from the Rule to make decision how to transform/filter an event/message From: Antonio Giambanco [mailto:antogia...@gmail.com]

Re: Spark Streaming and Drools

2015-05-22 Thread Dibyendu Bhattacharya
Hi, Sometime back I played with Distributed Rule processing by integrating Drool with HBase Co-Processors ..and invoke Rules on any incoming data .. https://github.com/dibbhatt/hbase-rule-engine You can get some idea how to use Drools rules if you see this RegionObserverCoprocessor ..

Re: 回复: 回复: How to use spark to access HBase with Security enabled

2015-05-22 Thread Ted Yu
Can you share the exception(s) you encountered ? Thanks On May 22, 2015, at 12:33 AM, donhoff_h 165612...@qq.com wrote: Hi, My modified code is listed below, just add the SecurityUtil API. I don't know which propertyKeys I should use, so I make 2 my own propertyKeys to find the

Re: Issues with constants in Spark HiveQL queries

2015-05-22 Thread Skanda
Hi I was using the wrong version of the spark-hive jar. I downloaded the right version of the jar from the cloudera repo and it works now. Thanks, Skanda On Fri, May 22, 2015 at 2:36 PM, Skanda skanda.ganapa...@gmail.com wrote: Hi All, I'm facing the same problem with Spark 1.3.0 from

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
The only “tricky” bit would be when you want to manage/update the Rule Base in your Drools Engines already running as Singletons in Executor JVMs on Worker Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already loaded in Drools is not a problem. From: Evo Eftimov

Re: Partitioning of Dataframes

2015-05-22 Thread Silvio Fiorito
This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote: Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then need to make into a Dataframe again? Maybe like this:

Parallel parameter tuning: distributed execution of MLlib algorithms

2015-05-22 Thread Hugo Ferreira
Hi, I am currently experimenting with linear regression (SGD) (Spark + MLlib, ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I do this (for now) by an exhaustive grid search of the step size and the number of iterations. Currently I am on a dual core that acts as

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Alright, that doesn't seem to have made it into the Python API yet. On 2015-05-22 15:12, Silvio Fiorito wrote: This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote: Hi, wouldn't df.rdd.partitionBy() return a new RDD

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then need to make into a Dataframe again? Maybe like this: df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird to me, though, and I'm not sure if the DF will be aware of its partitioning. On 2015-05-22

LDA prediction on new document

2015-05-22 Thread Charles Earl
Dani, Folding in I believe refers to setting up your Gibbs sampler (or other model) with the learning word and document topic proportions as computed by spark. You might look at https://lists.cs.princeton.edu/pipermail/topic-models/2014-May/002763.html Where Jones suggests summing across

Re: Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Cody Koeninger
I just verified that the following code works on 1.3.0 : val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic1) val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topic2)

partitioning after extracting from a hive table?

2015-05-22 Thread Cesar Flores
I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions. Do this partitioning corresponds to my original table partitioning in Hive? Thanks -- Cesar Flores

Re: How to use spark to access HBase with Security enabled

2015-05-22 Thread Frank Staszak
You might also enable debug in: hadoop-env.sh # Extra Java runtime options. Empty by default. export HADOOP_OPTS=$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}” and check that the principals are the same on the NameNode and DataNode. and you can

Re: Partitioning of Dataframes

2015-05-22 Thread Ted Yu
Looking at python/pyspark/sql/dataframe.py : @since(1.4) def coalesce(self, numPartitions): @since(1.3) def repartition(self, numPartitions): Would the above methods serve the purpose ? Cheers On Fri, May 22, 2015 at 6:57 AM, Karlson ksonsp...@siberie.de wrote: Alright, that

Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Shay Seng
Hi. I have a job that takes ~50min with Spark 0.9.3 and ~1.8hrs on Spark 1.3.1 on the same cluster. The only code difference between the two code bases is to fix the Seq - Iter changes that happened in the Spark 1.x series. Are there any other changes in the defaults from spark 0.9.3 - 1.3.1

Help reading Spark UI tea leaves..

2015-05-22 Thread Shay Seng
Hi. I have an RDD that I use repeatedly through many iterations of an algorithm. To prevent recomputation, I persist the RDD (and incidentally I also persist and checkpoint it's parents) val consCostConstraintMap = consCost.join(constraintMap).map { case (cid, (costs,(mid1,_,mid2,_,_))) = {

Re: Spark Streaming and Drools

2015-05-22 Thread Antonio Giambanco
Thanks a lot Evo, do you know where I can find some examples? Have a great one A G 2015-05-22 12:00 GMT+02:00 Evo Eftimov evo.efti...@isecc.com: You can deploy and invoke Drools as a Singleton on every Spark Worker Node / Executor / Worker JVM You can invoke it from e.g. map, filter etc

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Reynold Xin
I'm not sure if it is possible to overload the map function twice, once for just KV pairs, and another for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com wrote: This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the RDD API, but

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
The (crude) proof of concept seems to work: class RDD[V](value: List[V]){ def doStuff = println(I'm doing stuff) } object RDD{ implicit def toPair[V](x:RDD[V]) = new PairRDD(List((1,2))) } class PairRDD[K,V](value: List[(K,V)]) extends RDD (value){ def doPairs = println(I'm using pairs) }

How to share a (spring) singleton service with Spark?

2015-05-22 Thread Tristan107
I have a small Spark launcher app able to instanciate a service via Spring xml application context and then broadcasts it in order to make it available on remote nodes. I suppose when a Spring service is instanciated, all dependencies are instanciated and injected at the same time, so

Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the RDD API, but it could be even more discoverable if made available via the API directly. I assume this was originally an omission that now needs to be kept for backwards compatibility, but would any of the repo owners be open

Re: FetchFailedException and MetadataFetchFailedException

2015-05-22 Thread Imran Rashid
hmm, sorry I think that disproves my theory. Nothing else is immediately coming to mind. Its possible there is more info in the logs from the driver, couldn't hurt to send those (though I don't have high hopes of finding anything that way). Offchance this could be from too many open files or

Re: partitioning after extracting from a hive table?

2015-05-22 Thread ayan guha
I guess not. Spark partitions correspond to number of splits. On 23 May 2015 00:02, Cesar Flores ces...@gmail.com wrote: I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions.

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Josh Rosen
I don't think that 0.9.3 has been released, so I'm assuming that you're running on branch-0.9. There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afraid that this question doesn't have a concise answer: https://github.com/apache/spark/compare/branch-0.9...v1.3.1 To narrow down the

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread Xin Liu
Thank you guys for the prompt help. I ended up building spark master and verified what DB has suggested. val lr = (new MlLogisticRegression) .setFitIntercept(true) .setMaxIter(35) val model = lr.fit(sqlContext.createDataFrame(training)) val scoreAndLabels =