change the spark version

2015-09-11 Thread Angel Angel
Respected sir, I installed two versions of spark 1.2.0 (cloudera 5.3) and 1.4.0. I am running some application that need spark 1.4.0 The application is related to deep learning. *So how can i remove the version 1.2.0 * *and run my application on version 1.4.0 ?* When i run command spark-shell

Re: Multithreaded vs Spark Executor

2015-09-11 Thread Richard Eggert
Parallel processing is what Spark was made for. Let it do its job. Spawning your own threads independently of what Spark is doing seems like you'd just be asking for trouble. I think you can accomplish what you want by taking the cartesian product of the data element RDD and the feature list RDD a

Re: Implement "LIKE" in SparkSQL

2015-09-11 Thread Richard Eggert
concat and locate are available as of version 1.5.0, according to the Scaladocs. For earlier versions of Spark, and for the operations that are still not supported, it's pretty straightforward to define your own UserDefinedFunctions in either Scala or Java (I don't know about other languages). On

Multithreaded vs Spark Executor

2015-09-11 Thread Rachana Srivastava
Hello all, We are getting stream of input data from a Kafka queue using Spark Streaming API. For each data element we want to run parallel threads to process a set of feature lists (nearly 100 feature or more).Since feature lists creation is independent of each other we would like to execu

Implement "LIKE" in SparkSQL

2015-09-11 Thread liam
Hi, Imaging this: the value of one column is the substring of another column, when using Oracle,I got many ways to do the query like the following statement,but how to do in SparkSQL since this no "concat(), instr(), locate()..." select * from table t where t.a like '%'||t.b||'%'; Thanks.

Re: Multilabel classification support

2015-09-11 Thread Alexis Gillain
Do you mean by running a model on every label ? That's another solution of course. If you mean LogisticRegression natively "supports" multilabel, can you provide me some references. From what I see in the code it uses LabeledPoint which has only one label. 2015-09-11 21:54 GMT+08:00 Yanbo Liang :

Re: countApproxDistinctByKey in python

2015-09-11 Thread Ted Yu
It has not been ported yet. On Fri, Sep 11, 2015 at 4:13 PM, LucaMartinetti wrote: > Hi, > > I am trying to use countApproxDistinctByKey in pyspark but cannot find it. > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L417 > > Am I

Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Slight update: The following code with "spark context" works, with wild card file paths in hard coded strings but it won't work with a value parsed out of the program arguments as above: val sc = new SparkContext(sparkConf) val zipFileTextRDD = sc.textFile("/data/raw/logs/2015-09-01/home/logs/acc

countApproxDistinctByKey in python

2015-09-11 Thread LucaMartinetti
Hi, I am trying to use countApproxDistinctByKey in pyspark but cannot find it. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L417 Am I missing something or has not been ported / wrapped yet? Thanks -- View this message in context

SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-11 Thread Varadhan, Jawahar
Hi all,   I have a coded a custom receiver which receives kafka messages. These Kafka messages have FTP server credentials in them. The receiver then opens the message and uses the ftp credentials in it  to connect to the ftp server. It then streams this huge text file (3.3G) . Finally this stre

UserDefinedTypes

2015-09-11 Thread Richard Eggert
Greetings, I have recently started using Spark SQL and ran up against two rather odd limitations related to UserDefinedTypes. The first is that there appears to be no way to register a UserDefinefType other than by adding the @SQLUserDefinedType annotation to the class being mapped. This makes it

Re: java.util.NoSuchElementException: key not found

2015-09-11 Thread Yin Huai
Looks like you hit https://issues.apache.org/jira/browse/SPARK-10422, it has been fixed in branch 1.5. 1.5.1 release will have it. On Fri, Sep 11, 2015 at 3:35 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi all , > After upgrade spark to 1.5 , Streaming throw > java.util.No

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
Thanks, I'm surprised to see there are so much difference (4x), there could be something wrong in Spark (some contention between tasks). On Fri, Sep 11, 2015 at 11:47 AM, Jesse F Chen wrote: > > @Davies...good question.. > > > Just be curious how the difference would be if you use 20 executors >

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-11 Thread Tim Chen
Yes you can create an issue, or actually contribute a patch to update it :) Sorry the docs is a bit light, I'm going to make it more complete along the way. Tim On Fri, Sep 11, 2015 at 11:11 AM, Tom Waterhouse (tomwater) < tomwa...@cisco.com> wrote: > Tim, > > Thank you for the explanation. Y

which install package type for cassandra use

2015-09-11 Thread beakesland
Hello, Which install package type is suggested to add spark nodes to an existing Cassandra cluster? Will be using to deal with data already stored in Cassandra with connector. I am not currently running any Hadoop/CDH Thank you. Phil -- View this message in context: http://apache-spark-user-

Error - Calling a package (com.databricks:spark-csv_2.10:1.0.3) with spark-submit

2015-09-11 Thread Subhajit Purkayastha
I am on spark 1.3.1 When I do the following with spark-shell, it works spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 Then I can create a DF using the spark-csv package import sqlContext.implicits._ import org.apache.spark.sql._ // Return the dataset specified by d

Re: New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Cody Koeninger
No, in general you can't make new RDDs in code running on the executors. It looks like your properties file is a constant, why not process it at the beginning of the job and broadcast the result? On Fri, Sep 11, 2015 at 2:09 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > H

Re: Training the MultilayerPerceptronClassifier

2015-09-11 Thread Feynman Liang
Rory, I just sent a PR (https://github.com/avulanov/ann-benchmark/pull/1) to bring that benchmark up to date. Hope it helps. On Fri, Sep 11, 2015 at 6:39 AM, Rory Waite wrote: > Hi, > > I’ve been trying to train the new MultilayerPerceptronClassifier in spark > 1.5 for the MNIST digit recogniti

updateStateByKey when the state is very large

2015-09-11 Thread Brush,Ryan
All, I've run into the a usage pattern that seems like it would pop up elsewhere, so I'd like to kick around ways to solve it and perhaps land on a common and reusable approach. Here it is: Consider a use of Spark Streaming's updateStateByKey, but the state being maintained may be too large to

New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Rachana Srivastava
Hello all, Can we invoke JavaRDD while processing stream from Kafka for example. Following code is throwing some serialization exception. Not sure if this is feasible. JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.seconds(5)); JavaPairReceiverInputDStream messages

Re: selecting columns with the same name in a join

2015-09-11 Thread Michael Armbrust
Here is what I get on branch-1.5: x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF() y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3, v="Vincent")]).toDF() x.registerTempTable('x') y.registerTempTable('y') sqlContext.sql("select y.v, x.v FROM x INNER JOIN y ON x.k=y.k").colle

RE: MongoDB and Spark

2015-09-11 Thread Mishra, Abhishek
Hello, Don’t get me wrong here….just as per my understanding after reading your reply…. Are you telling me about MongoDb instances on multiple nodes I am talking of a single mongoDb instance/server having multiple collection in it….(say multiple tables). Please help me in understanding. A

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen
@Davies...good question.. > Just be curious how the difference would be if you use 20 executors > and 20G memory for each executor.. So I tried the following combinations: (GB X # executors) (query response time in secs) 20X20 415 10X40 230 5

Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Folks, Any help on this? Regards, Atul. On Fri, Sep 11, 2015 at 8:39 AM, Atul Kulkarni wrote: > Hi Raghavendra, > > Thanks for your answers, I am passing 10 executors and I am not sure if > that is the problem. It is still hung. > > Regards, > Atul. > > > On Fri, Sep 11, 2015 at 12:40 AM, Rag

Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Dean Wampler
Here's a demonstration video from @noootsab himself (creator of Spark Notebook) showing live charting in Spark Notebook. It's one reason I prefer it over the other options. https://twitter.com/noootsab/status/638489244160401408 Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

SparkR connection string to Cassandra

2015-09-11 Thread Austin Trombley
Spark, Do you have a SparkR connection string example of an RJDBC connection to a cassandra cluster? Thanks -- regards, Austin Trombley, MBA Senior Manager – Business Intelligence cell: 415-767-6179 CONFIDENTIALITY STATEMENT: This email message, together with all attachments, is intended o

Spark monitoring

2015-09-11 Thread prk77
Is there a way to fetch the current spark cluster memory & cpu usage programmatically ? I know that the default spark master web ui has these details but I want to retrieve them through a program and store them for analysis. -- View this message in context: http://apache-spark-user-list.1001560

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-11 Thread Tom Waterhouse (tomwater)
Tim, Thank you for the explanation. You are correct, my Mesos experience is very light, and I haven’t deployed anything via Marathon yet. What you have stated here makes sense, I will look into doing this. Adding this info to the docs would be great. Is the appropriate action to create an i

Re: Few Conceptual Questions on Spark-SQL and HiveQL

2015-09-11 Thread Narayanan K
Hi there ? Any replied :) -Narayanan On Fri, Sep 11, 2015 at 1:51 AM, Narayanan K wrote: > Hi all, > > We are migrating from Hive to Spark. We used Spark-SQL CLI to run our > Hive Queries for performance testing. I am new to Spark and had few > clarifications. We have : > > > 1. Set up 10 boxes,

Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Silvio Fiorito
So if you want to build your own from the ground up, then yes you could go the d3js route. Like Feynman also responded you could use something like Spark Notebook or Zeppelin to create some charts as well. It really depends on your intended audience and ultimate goal. If you just want some count

Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Jo Sunad
I've found Apache Zeppelin to be a good start if you want to visualize spark data. It doesn't come with streaming visualizations, although I've seen people tweak the code so it does let you do real time visualizations with spark streaming Other tools I've heard about are python notebook and spark

Re: Model summary for linear and logistic regression.

2015-09-11 Thread Feynman Liang
Sorry! The documentation is not the greatest thing in the world, but these features are documented here On Fri, Sep 11, 2015 at 6:25 AM, Sebastian Kuepers < sebastian.kuep...@publicispixelpark.de> wrote: > Hey, > > > the 1.5.0 release

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
On Fri, Sep 11, 2015 at 10:31 AM, Jesse F Chen wrote: > > Thanks Hao! > > I tried your suggestion of setting spark.shuffle.reduceLocality.enabled=false > and my initial tests showed queries are on par between 1.5 and 1.4.1. > > Results: > > tpcds-query39b-141.out:query time: 129.106478631 sec > t

Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Feynman Liang
Spark notebook does something similar, take a look at their line chart code On Fri, Sep 11, 2015 at 8:56 AM, Shashi Vishwakarma < shashi.vish...@gmail.com> wrote: > Hi > > I have go

Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
I had ran similar benchmark for 1.5, do self join on a fact table with join key that had many duplicated rows (there are N rows for the same join key), say N, after join, there will be N*N rows for each join key. Generating the joined row is slower in 1.5 than 1.4 (it needs to copy left and right r

RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen
Thanks Hao! I tried your suggestion of setting spark.shuffle.reduceLocality.enabled =false and my initial tests showed queries are on par between 1.5 and 1.4.1. Results: tpcds-query39b-141.out:query time: 129.106478631 sec tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
Having the driver write the data instead of a worker probably won't spread it up, you still need to copy all of the data to a single node. Is there something which forces you to only write from a single node? On Friday, September 11, 2015, Luca wrote: > Hi, > thanks for answering. > > With the *

Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-11 Thread Davies Liu
Did this happen immediately after you start the cluster or after ran some queries? Is this in local mode or cluster mode? On Fri, Sep 11, 2015 at 3:00 AM, Jagat Singh wrote: > Hi, > > We have queries which were running fine on 1.4.1 system. > > We are testing upgrade and even simple query like >

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Luca
Hi, thanks for answering. With the *coalesce() *transformation a single worker is in charge of writing to HDFS, but I noticed that the single write operation usually takes too much time, slowing down the whole computation (this is particularly true when 'unified' is made of several partitions). Be

Re: Cassandra row count grouped by multiple columns

2015-09-11 Thread Eric Walker
Hi Chirag, Maybe something like this? import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(Seq( Row("A1", "B1", "C1"), Row("A2", "B2", "C2"), Row("A3", "B3", "C2"), Row("A1", "B1", "C1") )) val schema = StructType(Seq("a", "b", "c").map(c => StructF

Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
A common practice to do this is to use foreachRDD with a local var to accumulate the data (you can see it in the Spark Streaming test code). That being said, I am a little curious why you want the driver to create the file specifically. On Friday, September 11, 2015, allonsy > wrote: > Hi everyo

I'd like to add our company to the Powered by Spark page

2015-09-11 Thread Timothy Snyder
Hello, I'm interested in adding our company to this Powered by Spark page. I've included some information below, but if you have any questions or need any additional information please let me know. Organization name: Hawk Sea

Help with collect() in Spark Streaming

2015-09-11 Thread allonsy
Hi everyone, I have a JavaPairDStream object and I'd like the Driver to create a txt file (on HDFS) containing all of its elements. At the moment, I use the /coalesce(1, true)/ method: JavaPairDStream unified = [partitioned stuff] unified.foreachRDD(new Function, Void>() {

Realtime Data Visualization Tool for Spark

2015-09-11 Thread Shashi Vishwakarma
Hi I have got streaming data which needs to be processed and send for visualization. I am planning to use spark streaming for this but little bit confused in choosing visualization tool. I read somewhere that D3.js can be used but i wanted know which is best tool for visualization while dealing w

Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Ted Yu
Very nice suggestion, Richard. I logged SPARK-10561 referencing this discussion. On Fri, Sep 11, 2015 at 8:15 AM, Richard Hillegas wrote: > The latest Derby SQL Reference manual (version 10.11) can be found here: > https://db.apache.org/derby/docs/10.11/ref/index.html. It is, indeed, > very use

Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Hi Raghavendra, Thanks for your answers, I am passing 10 executors and I am not sure if that is the problem. It is still hung. Regards, Atul. On Fri, Sep 11, 2015 at 12:40 AM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > You can pass the number of executors via command line opti

Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Peyman Mohajerian
http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSqlSupportedSyntax.html On Fri, Sep 11, 2015 at 8:15 AM, Richard Hillegas wrote: > The latest Derby SQL Reference manual (version 10.11) can be found here: > https://db.apache.org/derby/docs/10.11/ref/index.html.

Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Richard Hillegas
The latest Derby SQL Reference manual (version 10.11) can be found here: https://db.apache.org/derby/docs/10.11/ref/index.html. It is, indeed, very useful to have a comprehensive reference guide. The Derby build scripts can also produce a BNF description of the grammar--but that is not part of the

A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg
Is there a way to kill a laggard Spark job manually, and more importantly, is there a way to do it programmatically based on a configurable timeout value? Thanks.

RE: Exception Handling : Spark Streaming

2015-09-11 Thread Samya MAITI
Yah Ted I am trying to handle the exception from saveToCassandra(). But if I don’t place the catch below ssc.awaitTermination(), then the exception is not handled even the first time. Looks like the main thread is out of exception block once the first exception is handled, not sure!!! -

Exception Handling : Spark Streaming

2015-09-11 Thread Samya
Hi Team, I am facing this issue where in I can't figure out why the exception is handled the first time an exception is thrown in the stream processing action, but is ignored the second time. PFB my code base. object Boot extends App { //Load the configuration val config = LoadConfig.get

Re: Exception Handling : Spark Streaming

2015-09-11 Thread Ted Yu
Was your intention that exception from rdd.saveToCassandra() be caught ? In that case you can place try / catch around that call. Cheers On Fri, Sep 11, 2015 at 7:30 AM, Samya wrote: > Hi Team, > > I am facing this issue where in I can't figure out why the exception is > handled the first time

Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client API so the problem with trying to schedule Spark tasks against it is that the tasks themselves cannot be scheduled locally on nodes containing query results- which means you can only assume most results will be sent over th

Re: Is it required to remove checkpoint when submitting a code change?

2015-09-11 Thread Cody Koeninger
Yeah, it makes sense that parameters that are read only during your getOrCCreate function wouldn't be re-read, since that function isn't called if a checkpoint is loaded. I would have thought changing number of executors and other things used by spark-submit would work on checkpoint restart. Have

Re: MLlib LDA implementation questions

2015-09-11 Thread Carsten Schnober
Hi, I don't have practical experience with the MLlib LDA implementation, but regarding the variations in the topic matrix: LDA make use of stochastic processes. If you use setSeed(seed) with the same value for seed during initialization, your results should be identical though. May I ask what exac

Re: Multilabel classification support

2015-09-11 Thread Yanbo Liang
LogisticRegression in MLlib(not ML) package supports both multiclass and multilabel classification. 2015-09-11 16:21 GMT+08:00 Alexis Gillain : > You can try these packages for adaboost.mh : > > https://github.com/BaiGang/spark_multiboost (scala) > or > https://github.com/tizfa/sparkboost (java)

Training the MultilayerPerceptronClassifier

2015-09-11 Thread Rory Waite
Hi, I’ve been trying to train the new MultilayerPerceptronClassifier in spark 1.5 for the MNIST digit recognition task. I’m trying to reproduce the work here: https://github.com/avulanov/ann-benchmark The API has changed since this work, so I’m not sure that I’m setting up the task correctly.

Re: MongoDB and Spark

2015-09-11 Thread Sandeep Giri
I think it should be possible by loading collections as RDD and then doing a union on them. Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon]

Fwd: MLlib LDA implementation questions

2015-09-11 Thread Marko Asplund
Hi, We're considering using Spark MLlib (v >= 1.5) LDA implementation for topic modelling. We plan to train the model using a data set of about 12 M documents and vocabulary size of 200-300 k items. Documents are relatively short, typically containing less than 10 words, but the number can range u

Model summary for linear and logistic regression.

2015-09-11 Thread Sebastian Kuepers
Hey, the 1.5.0 release note say, that there are now model summaries for logistic regressions available. But I can't find them in the current documentary. ? Any help very much appreciated! Thanks Sebastian Disclaim

Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-11 Thread Ted Yu
Have you looked at: https://issues.apache.org/jira/browse/SPARK-8013 > On Sep 11, 2015, at 4:53 AM, Petr Novak wrote: > > Does it still apply for 1.5.0? > > What actual limitation does it mean when I switch to 2.11? No JDBC > Thriftserver? No JDBC DataSource? No JdbcRDD (which is already obs

Re: Is there any Spark SQL reference manual?

2015-09-11 Thread vivek bhaskar
Hi Ted, The link you mention do not have complete list of supported syntax. For example, few supported syntax are listed as "Supported Hive features" but that do not claim to be exhaustive (even if it is so, one has to filter out a lot many lines from Hive QL reference and still will not be sure i

Spark does not yet support its JDBC component for Scala 2.11.

2015-09-11 Thread Petr Novak
Does it still apply for 1.5.0? What actual limitation does it mean when I switch to 2.11? No JDBC Thriftserver? No JDBC DataSource? No JdbcRDD (which is already obsolete I believe)? Some more? What library is the blocker to upgrade JDBC component to 2.11? Is there any estimate when it could be a

Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Ted Yu
You may have seen this: https://spark.apache.org/docs/latest/sql-programming-guide.html Please suggest what should be added. Cheers On Fri, Sep 11, 2015 at 3:43 AM, vivek bhaskar wrote: > Hi all, > > I am looking for a reference manual for Spark SQL some thing like many > database vendors have

Is there any Spark SQL reference manual?

2015-09-11 Thread vivek bhaskar
Hi all, I am looking for a reference manual for Spark SQL some thing like many database vendors have. I could find one for hive ql https://cwiki.apache.org/confluence/display/Hive/LanguageManual but not anything specific to spark sql. Please suggest. SQL reference specific to latest release will

java.util.NoSuchElementException: key not found

2015-09-11 Thread guoqing0...@yahoo.com.hk
Hi all , After upgrade spark to 1.5 , Streaming throw java.util.NoSuchElementException: key not found occasionally , is the problem of data cause this error ? please help me if anyone got similar problem before , Thanks very much. the exception accur when write into database. org.apache.spa

Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-11 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtPPuSvBu0rj2 > On Sep 11, 2015, at 3:00 AM, Jagat Singh wrote: > > Hi, > > We have queries which were running fine on 1.4.1 system. > > We are testing upgrade and even simple query like > val t1= sqlContext.sql("select count(*) from

selecting columns with the same name in a join

2015-09-11 Thread Evert Lammerts
Am I overlooking something? This doesn't seem right: x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF() y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3, v="Vincent")]).toDF() x.registerTempTable('x') y.registerTempTable('y') sqlContext.sql("select y.v, x.v FROM x INNER JOIN y

Exception in Spark-sql insertIntoJDBC command

2015-09-11 Thread Baljeet Singh
Hi, I’m using Spark-Sql to insert the data in the csv in a table in Sql Server as database. The createJDBCTable command is working fine with it. But when i’m trying to insert more records in the same table that I have created in database using insertIntoJDBC, it is throwing an error message – Ex

RE: MongoDB and Spark

2015-09-11 Thread Mishra, Abhishek
Anything using Spark RDD’s ??? Abhishek From: Sandeep Giri [mailto:sand...@knowbigdata.com] Sent: Friday, September 11, 2015 3:19 PM To: Mishra, Abhishek; user@spark.apache.org; d...@spark.apache.org Subject: Re: MongoDB and Spark use map-reduce. On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek m

Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-11 Thread Jagat Singh
Hi, We have queries which were running fine on 1.4.1 system. We are testing upgrade and even simple query like val t1= sqlContext.sql("select count(*) from table") t1.show This works perfectly fine on 1.4.1 but throws OOM error in 1.5.0 Are there any changes in default memory settings from 1.

Re: MongoDB and Spark

2015-09-11 Thread Sandeep Giri
use map-reduce. On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek wrote: > Hello , > > > > Is there any way to query multiple collections from mongodb using spark > and java. And i want to create only one Configuration Object. Please help > if anyone has something regarding this. > > > > > > Thank Y

MongoDB and Spark

2015-09-11 Thread Mishra, Abhishek
Hello , Is there any way to query multiple collections from mongodb using spark and java. And i want to create only one Configuration Object. Please help if anyone has something regarding this. Thank You Abhishek

Few Conceptual Questions on Spark-SQL and HiveQL

2015-09-11 Thread Narayanan K
Hi all, We are migrating from Hive to Spark. We used Spark-SQL CLI to run our Hive Queries for performance testing. I am new to Spark and had few clarifications. We have : 1. Set up 10 boxes, one master and 9 slaves in standalone mode. Each of the boxes are launchers to our external Hadoop grid.

RE:RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread prosp4300
By the way turn off the code generation could be an option to try, sometime code generation could introduce slowness 在2015年09月11日 15:58,Cheng, Hao 写道: Can you confirm if the query really run in the cluster mode? Not the local mode. Can you print the call stack of the executor when the query

Re: Multilabel classification support

2015-09-11 Thread Alexis Gillain
You can try these packages for adaboost.mh : https://github.com/BaiGang/spark_multiboost (scala) or https://github.com/tizfa/sparkboost (java) 2015-09-11 15:29 GMT+08:00 Yasemin Kaya : > Hi, > > I want to use Mllib for multilabel classification, but I find > http://spark.apache.org/docs/latest/

RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
Can you confirm if the query really run in the cluster mode? Not the local mode. Can you print the call stack of the executor when the query is running? BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not Spark SQL. From: Todd [mailto:bit1...@163.com] Sent: Friday, Sept

sparksql query hive data error

2015-09-11 Thread stark_summer
start hive metastore service OK hadoop io compression codec is lzo, configure is core-site.xml io.compression.codecs org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,

Re: Spark based Kafka Producer

2015-09-11 Thread Raghavendra Pandey
You can pass the number of executors via command line option --num-executors.You need more than 2 executors to make spark-streaming working. For more details on command line option, please go through http://spark.apache.org/docs/latest/running-on-yarn.html. On Fri, Sep 11, 2015 at 10:52 AM, Atul

Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Todd
I add the following two options: spark.sql.planner.sortMergeJoin=false spark.shuffle.reduceLocality.enabled=false But it still performs the same as not setting them two. One thing is that on the spark ui, when I click the SQL tab, it shows an empty page but the header title 'SQL',there is no tab

Multilabel classification support

2015-09-11 Thread Yasemin Kaya
Hi, I want to use Mllib for multilabel classification, but I find http://spark.apache.org/docs/latest/mllib-classification-regression.html, it is not what I mean. Is there a way to use multilabel classification? Thanks alot. Best, yasemin -- hiç ender hiç