Locality aware tree reduction

2016-05-04 Thread aymkhalil
Hello, Is there a way to instruct treeReduce() to reduce RDD partitions on the same node locally? In my case, I'm using treeReduce() to reduce map results in parallel. My reduce function is just arithmetically adding map values (i.e. no notion of aggregation by key). As far as I understand, a

Re: DeepSpark: where to start

2016-05-04 Thread Derek Chan
The blog post is a April Fool's joke. Read the last line in the post: https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html On Thursday, May 05, 2016 10:42 AM, Joice Joy wrote: I am trying to find info on deepspark. I read the article on databricks

ArrayIndexOutOfBoundsException in model selection via cross-validation sample with spark 1.6.1

2016-05-04 Thread Terry Hoo
All, I met the ArrayIndexOutOfBoundsException when run the model selection via cross-validation sample with spark 1.6.1, did anyone else meet this before? How to resolve this? Call stack:

Re: DeepSpark: where to start

2016-05-04 Thread Ted Yu
Did you notice the date of the blog :-) ? On Wed, May 4, 2016 at 7:42 PM, Joice Joy wrote: > I am trying to find info on deepspark. I read the article on databricks > blog which doesnt mention a git repo but does say its open source. > Help me find the git repo for this.

DeepSpark: where to start

2016-05-04 Thread Joice Joy
I am trying to find info on deepspark. I read the article on databricks blog which doesnt mention a git repo but does say its open source. Help me find the git repo for this. I found two and not sure which one is the databricks deepspark: https://github.com/deepspark/deepspark

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-04 Thread Divya Gehlot
Hi, My Javac version C:\Users\Divya>javac -version javac 1.7.0_79 C:\Users\Divya>java -version java version "1.7.0_79" Java(TM) SE Runtime Environment (build 1.7.0_79-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode) Do I need use higher version ? Thanks, Divya On 4 May

Re: Do I need to install Cassandra node on Spark Master node to work with Cassandra?

2016-05-04 Thread Yogesh Mahajan
You can have a Spark master where Cassandra is not running locally. I have tried this before. Spark cluster and Cassandra cluster could be on two different hosts, but to colocate, you can have both the executor and Cassandra node on same host. Thanks, http://www.snappydata.io/blog

Do I need to install Cassandra node on Spark Master node to work with Cassandra?

2016-05-04 Thread Vinayak Agrawal
Hi All, I am working with a Cassandra cluster and moving towards installing Spark. However, I came across this Stackoverflow question which has confused me. http://stackoverflow.com/questions/33897586/apache-spark-driver-instead-of-just-the-executors-tries-to-connect-to-cassand Question: Do I

Re: yarn-cluster

2016-05-04 Thread nsalian
Hi, this is a good spot to start for Spark and YARN. https://spark.apache.org/docs/1.5.0/running-on-yarn.html specific to the version you are on, you can toggle between pages. - Neelesh S. Salian Cloudera -- View this message in context:

Re: Spark standalone workers, executors and JVMs

2016-05-04 Thread Mich Talebzadeh
Hi, More cores without getting memory per core ratio correct can result in more queuing and hence more contention as was evident from the earlier published results I had a bit of discussion with one of the spark experts who stated/claimed one should have one executor per server and then get

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Thanks for the suggestions and links. The problem arises when I used DataFrame api to write but it works fine when doing insert overwrite in hive table. # Works good hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select * from temp_table".format(table_name)) # Doesn't work,

RE: Spark standalone workers, executors and JVMs

2016-05-04 Thread Mohammed Guller
Spark allows you configure the resources for the worker process. If I remember it correctly, you can use SPARK_DAEMON_MEMORY to control memory allocated to the worker process. #1 below is more appropriate if you will be running just one application at a time. 32GB heap size is still too high.

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Prajwal Tuladhar
If you are running on 64-bit JVM with less than 32G heap, you might want to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow generating more than 2^31-1 number of arrays, you might have to rethink your options. [1] https://spark.apache.org/docs/latest/tuning.html On Wed, May 4,

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtyXr2N13hf9O=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitions

SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Hi, I am reading the parquet file around 50+ G which has 4013 partitions with 240 columns. Below is my configuration driver : 20G memory with 4 cores executors: 45 executors with 15G memory and 4 cores. I tried to read the data using both Dataframe read and using hive context to read the data

DAG Pipelines?

2016-05-04 Thread Cesar Flores
I read on the ml-guide page ( http://spark.apache.org/docs/latest/ml-guide.html#details). It mention that it is possible to construct DAG Pipelines. Unfortunately there is no example to explain under which use case this may be useful. *Can someone give me an example or use case where this

Re: Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak
Thanks Ted. This looks like the issue since I am running it in EMR and the Hive version is 1.0.0. Thanks, Bijay On Wed, May 4, 2016 at 10:29 AM, Ted Yu wrote: > Looks like you were hitting HIVE-11940 > > On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak

Stackoverflowerror in scala.collection

2016-05-04 Thread BenD
I am getting a java.lang.StackOverflowError somewhere in my program. I am not able to pinpoint which part causes it because the stack trace seems to be incomplete (see end of message). The error doesn't happen all the time, and I think it is based on the number of files that I load. I am running

Re: Writing output of key-value Pair RDD

2016-05-04 Thread Nicholas Chammas
You're looking for this discussion: http://stackoverflow.com/q/23995040/877069 Also, a simpler alternative with DataFrames: https://github.com/apache/spark/pull/8375#issuecomment-202458325 On Wed, May 4, 2016 at 4:09 PM Afshartous, Nick wrote: > Hi, > > > Is there any

Writing output of key-value Pair RDD

2016-05-04 Thread Afshartous, Nick
Hi, Is there any way to write out to S3 the values of a f key-value Pair RDD ? I'd like each value of a pair to be written to its own file where the file name corresponds to the key name. Thanks, -- Nick

Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Mich Talebzadeh
Hang on Are you talking about target database in MSSQL created and dropped? Is your Spark JDBC credential in MSSQL a DBO or sa? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Andrés Ivaldi
Ok, so I did that, create a database and then insert data, but spark drop database and try to create it again, I'm using Dataframe.write(SaveMode.Overwrite), documentation said : "when performing a Overwrite, the data will be deleted before writing out the new data." why is dropping the table?

Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-05-04 Thread HLee
I had the same problem. One forum post elsewhere suggested that too much network communication might be using up available ports. I reduced the partition size via repartition(int) and it solved the problem. -- View this message in context:

spark job stage failures

2016-05-04 Thread Prajwal Tuladhar
Hi, I was wondering how Spark handle stage / task failures for a job. We are running a Spark job to batch write to ElasticSearch and we are seeing one or two stage failures due to ES cluster getting over loaded (expected as we are testing with single node ES cluster). But I was assuming that

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Mich Talebzadeh
This works spark 1.61, using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77) Kafka version 0.9.0.1 using scala-library-2.11.7.jar Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Shixiong(Ryan) Zhu
It's because the Scala version of Spark and the Scala version of Kafka don't match. Please check them. On Wed, May 4, 2016 at 6:17 AM, أنس الليثي wrote: > NoSuchMethodError usually appears because of a difference in the library > versions. > > Check the version of the

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Lohith Samaga M
Hi Can you look at Apache Drill as sql engine on hive? Lohith Sent from my Sony Xperia™ smartphone Tapan Upadhyay wrote Thank you everyone for guidance. Jorn our motivation is to move bulk of adhoc queries to hadoop so that we have enough bandwidth on our DB for imp batch/queries.

Re: Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Ted Yu
Looks like you were hitting HIVE-11940 On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak wrote: > Hello, > > I am writing Dataframe of around 60+ GB into partitioned Hive Table using > hiveContext in parquet format. The Spark insert overwrite jobs completes in > a reasonable

Re: IS spark have CapacityScheduler?

2016-05-04 Thread Ted Yu
Cycling old bits: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scheduling-with-Capacity-scheduler-td10038.html On Wed, May 4, 2016 at 7:44 AM, 开心延年 wrote: > Scheduling Within an Application > > I found FAIRSchedule,but is there som exampe implements like yarn >

Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak
Hello, I am writing Dataframe of around 60+ GB into partitioned Hive Table using hiveContext in parquet format. The Spark insert overwrite jobs completes in a reasonable amount of time around 20 minutes. But the job is taking a huge amount of time more than 2 hours to copy data from .hivestaging

unsubscribe

2016-05-04 Thread Vadim Vararu
unsubscribe - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: groupBy and store in parquet

2016-05-04 Thread Xinh Huynh
Hi Michal, For (1), would it be possible to partitionBy two columns to reduce the size? Something like partitionBy("event_type", "date"). For (2), is there a way to separate the different event types upstream, like on different Kafka topics, and then process them separately? Xinh On Wed, May

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Tapan Upadhyay
Thank you everyone for guidance. *Jorn* our motivation is to move bulk of adhoc queries to hadoop so that we have enough bandwidth on our DB for imp batch/queries. For implementing lambda architecture is it possible to get the real time updates from Teradata of any insert/update/delete? DBlogs?

groupBy and store in parquet

2016-05-04 Thread Michal Vince
Hi guys I`m trying to store kafka stream with ~5k events/s as efficiently as possible in parquet format to hdfs. I can`t make any changes to kafka (belongs to 3rd party) Events in kafka are in json format, but the problem is there are many different event types (from different subsystems

IS spark have CapacityScheduler?

2016-05-04 Thread ????????
Scheduling Within an Application I found FAIRSchedule,but is there som exampe implements like yarn CapacityScheduler? FAIR 1 2 FIFO 2 3

Re: restrict my spark app to run on specific machines

2016-05-04 Thread Ted Yu
Please refer to: https://spark.apache.org/docs/latest/running-on-yarn.html You can setup spark.yarn.am.nodeLabelExpression and spark.yarn.executor.nodeLabelExpression corresponding to the 2 machines. On Wed, May 4, 2016 at 3:03 AM, Shams ul Haque wrote: > Hi, > > I have a

Re: Spark standalone workers, executors and JVMs

2016-05-04 Thread Simone Franzini
Hi Mohammed, Thanks for your reply. I agree with you, however a single application can use multiple executors as well, so I am still not clear which option is best. Let me make an example to be a little more concrete. Let's say I am only running a single application. Let's assume again that I

Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Cody Koeninger
Kafka 0.8.2 should be fine. If it works on your laptop but not on CDH, as Sean said you'll probably get better help on CDH forums. On Wed, May 4, 2016 at 4:19 AM, Michel Hubert wrote: > We're running Kafka 0.8.2.2 > Is that the problem, why? > > -Oorspronkelijk

Spark MLLib benchmarks

2016-05-04 Thread kmurph
Hi, I'm benchmarking Spark(1.6) and MLLib TF-IDF (with hdfs) on a 20GB dataset, and not seeing much scale-up when I increase cores/executors/RAM according to Spark tuning documentation. I suspect I'm missing a trick in my configuration. I'm running on shared memory (96 cores, 256GB RAM) and

Re: Spark Select Statement

2016-05-04 Thread Mich Talebzadeh
which database is that table, a Hive database? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 4 May

Re: Spark Select Statement

2016-05-04 Thread Ted Yu
Please take a look at sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java : } else if (key.startsWith("use:")) { SessionState.get().setCurrentDatabase(entry.getValue()); bq. no such table winbox_prod_action_logs_1 The above doesn't match

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-04 Thread sunday2000
Check your javac version, and update it. -- -- ??: "Divya Gehlot";; : 2016??5??4??(??) 11:25 ??: "sunday2000"<2314476...@qq.com>; : "user"; "user";

Reading from cassandra store in rdd

2016-05-04 Thread Yasemin Kaya
Hi, I asked this question datastax group but i want to ask also spark-user group, someone may face this problem. I have a data in Cassandra and want to get data to SparkRDD. I got an error , searched it but nothing changed. Is there anyone can help me to fix it? I can connect Cassandra and cqlsh

Re: Error from reading S3 in Scala

2016-05-04 Thread Steve Loughran
On 4 May 2016, at 13:52, Zhang, Jingyu > wrote: Thanks everyone, One reason to use "s3a//" is because I use "s3a//" in my development env (Eclipse) on a desktop. I will debug and test on my desktop then put jar file on EMR Cluster. I

Re: Spark and Kafka direct approach problem

2016-05-04 Thread أنس الليثي
NoSuchMethodError usually appears because of a difference in the library versions. Check the version of the libraries you downloaded, the version of spark, the version of Kafka. On 4 May 2016 at 16:18, Luca Ferrari wrote: > Hi, > > I’m new on Apache Spark and I’m trying

Spark and Kafka direct approach problem

2016-05-04 Thread Luca Ferrari
Hi, I’m new on Apache Spark and I’m trying to run the Spark Streaming + Kafka Integration Direct Approach example (JavaDirectKafkaWordCount.java). I’ve downloaded all the libraries but when I try to run I get this error Exception in thread "main" java.lang.NoSuchMethodError:

Re: Error from reading S3 in Scala

2016-05-04 Thread Zhang, Jingyu
Thanks everyone, One reason to use "s3a//" is because I use "s3a//" in my development env (Eclipse) on a desktop. I will debug and test on my desktop then put jar file on EMR Cluster. I do not think "s3//" will works on a desktop. With helping from AWS suport, this bug is cause by the version

spark w/ scala 2.11 and PackratParsers

2016-05-04 Thread matd
Hi folks, Our project is a mess of scala 2.10 and 2.11, so I tried to switch everything to 2.11. I had some exasperating errors like this : java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/DDLParser at org.apache.spark.sql.SQLContext.(SQLContext.scala:208) at

Re: Error from reading S3 in Scala

2016-05-04 Thread James Hammerton
On 3 May 2016 at 17:22, Gourav Sengupta wrote: > Hi, > > The best thing to do is start the EMR clusters with proper permissions in > the roles that way you do not need to worry about the keys at all. > > Another thing, why are we using s3a// instead of s3:// ? >

restrict my spark app to run on specific machines

2016-05-04 Thread Shams ul Haque
Hi, I have a cluster of 4 machines for Spark. I want my Spark app to run on 2 machines only. And rest 2 machines for other Spark apps. So my question is, can I restrict my app to run on that 2 machines only by passing some IP at the time of setting SparkConf or by any other setting? Thanks,

Re: Bit(N) on create Table with MSSQLServer

2016-05-04 Thread Andrés Ivaldi
Yes, I can do that, it's what we are doing now, but I think the best approach would be delegate the create table action to spark. On Tue, May 3, 2016 at 8:17 PM, Mich Talebzadeh wrote: > Can you create the MSSQL (target) table first with the correct column > setting

Spark Select Statement

2016-05-04 Thread Sree Eedupuganti
Hello Spark users, can we query the SQL SELECT statement in Spark using Java. if it is possible any suggestions please. I tried like this.How to pass the database name. Here my database name is nimbus and table name is winbox_opens. *Source Code :* *public class Select { public static class

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Alonso Isidoro Roman
Andy, i think there are some ideas to implement a pool of spark context, but, for now, it is only an idea. https://github.com/spark-jobserver/spark-jobserver/issues/365 It is possible to share a spark context between apps, i did not have to use this feature, sorry about that. Regards, Alonso

RE: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Michel Hubert
We're running Kafka 0.8.2.2 Is that the problem, why? -Oorspronkelijk bericht- Van: Sean Owen [mailto:so...@cloudera.com] Verzonden: woensdag 4 mei 2016 10:41 Aan: Michel Hubert CC: user@spark.apache.org Onderwerp: Re: run-example streaming.KafkaWordCount fails on CDH

substitute mapPartitions by distinct

2016-05-04 Thread Batselem
Hi, I am trying to remove duplicates from a set of RDD tuples in an iterative algorithm. I have discovered that it is possible to substitute RDD mapPartitions for RDD distinct. First I partitioned the RDD and distinct it locally using mapPartitions transformation. I expect it will be much faster

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Tobias Eriksson
Hi Andy, We have a very simple approach I think, we do like this 1. Submit our Spark application to the Spark Master (version 1.6.1.) 2. Our Application creates a Spark Context that we use throughout 3. We use Spray REST server 4. Every request that comes in we simply serve by

Re: run-example streaming.KafkaWordCount fails on CDH 5.7.0

2016-05-04 Thread Sean Owen
Please try the CDH forums; this is the Spark list: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark Before you even do that, I can tell you to double check you're running Kafka 0.9. On Wed, May 4, 2016 at 9:29 AM, Michel Hubert wrote: > > > Hi, > > >

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Alonso Isidoro Roman
I agree with Deepak and i would try to save data in parquet and avro format, if you can, try to measure the performance and choose the best, it will probably be parquet, but you have to know for yourself. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Jörn Franke
Look at lambda architecture. What is the motivation of your migration? > On 04 May 2016, at 03:29, Tapan Upadhyay wrote: > > Hi, > > We are planning to move our adhoc queries from teradata to spark. We have > huge volume of queries during the day. What is best way to go

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Mich Talebzadeh
Hi, How are you going to sync your data following migration? Spark SQL is a tool for querying data. It is not a database per se like Hive or anything else. I am just doing the same migrating Sybase IQ to Hive. Sqoop can do the initial ELT (read ELT not ETL). In other words use Sqoop to get