Re: How to list all dataframes and RDDs available in current session?

2015-08-21 Thread Raghavendra Pandey
You get the list of all the persistet rdd using spark context... On Aug 21, 2015 12:06 AM, Rishitesh Mishra rishi80.mis...@gmail.com wrote: I am not sure if you can view all RDDs in a session. Tables are maintained in a catalogue . Hence its easier. However you can see the DAG representation

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora
it comes at start of each tasks when there is new data inserted in kafka.( data inserted is very few) kafka topic has 300 partitions - data inserted is ~10 MB. Tasks gets failed and it retries which succeed and after certain no of fail tasks it kills the job. On Sat, Aug 22, 2015 at 2:08 AM,

Re: How can I save the RDD result as Orcfile with spark1.3?

2015-08-21 Thread dong.yajun
hi Ted, thanks for your reply, are there any other way to do this with spark 1.3? such as write the orcfile manually in foreachPartition method? On Sat, Aug 22, 2015 at 12:19 PM, Ted Yu yuzhih...@gmail.com wrote: ORC support was added in Spark 1.4 See SPARK-2883 On Fri, Aug 21, 2015 at 7:36

Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang
ML plans to make Machine Learning pipeline that users can make machine learning more efficient. It's more general to make StringIndexer chain with any kinds of Estimators. I think we can make StringIndexer and reverse process automatic in the future. If you want to know your original labels, you

DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Eugene Morozov
Hi, I'm using spark 1.3.1 built against hadoop 1.0.4 and java 1.7 and I'm trying to save my data frame to parquet. The issue I'm stuck looks like serialization tries to do pretty weird thing: tries to write to an empty array. The last (through stack trace) line of spark code that leads to

Is long running Spark batch job in fine grained mode is Deprecated?

2015-08-21 Thread Akash Mishra
Hi *, We are trying to run Spark on top of mesos using fine grained mode. While talking to few people i came to know that running Spark job using fine grained mode on mesos is not a good idea. I could not find anything regarding fine grained mode getting deprecated and also if corse grained mode

Spark streaming multi-tasking during I/O

2015-08-21 Thread Sateesh Kavuri
Hi, My scenario goes like this: I have an algorithm running in Spark streaming mode on a 4 core virtual machine. Majority of the time, the algorithm does disk I/O and database I/O. Question is, during the I/O, where the CPU is not considerably loaded, is it possible to run any other task/thread

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-21 Thread Michael Albert
This is something of a wild guess, but I find that when executors start disappearingfor no obvious reason, this is usually because the yarn node-managers have decided that the containers are using too much memory and then terminate the executors. Unfortunately, to see evidence of this, one

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964
I believe spark-shell -i scriptFile is there. We also use it, at least in Spark 1.3.1. dse spark will just wrap spark-shell command, underline it is just invoking spark-shell. I don't know too much about the original problem though. Yong Date: Fri, 21 Aug 2015 18:19:49 +0800 Subject: Re:

Re: SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread Roberto Congiu
2015-08-21 3:17 GMT-07:00 smagadi sudhindramag...@fico.com: teenagers .toJSON gives the json but it does not preserve the parent ids meaning if the input was {name:Yin, address:{city:Columbus,state:Ohio},age:20} val x= sqlContext.sql(SELECT name, address.city, address.state ,age FROM

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
HI All, Any inputs for the actual problem statement Regards, Satish On Fri, Aug 21, 2015 at 5:57 PM, Jeff Zhang zjf...@gmail.com wrote: Yong, Thanks for your reply. I tried spark-shell -i script-file, it works fine for me. Not sure the different with dse spark --master local --jars

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-08-21 Thread Ravi Mody
I've been able to almost halve my memory usage with no instability issues. I lowered my storage.memoryFraction and increased my shuffle.memoryFraction (essentially swapping them). I set spark.yarn.executor.memoryOverhead to 6GB. And I lowered executor-cores in case other jobs are using the

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Abhishek R. Singh
You had: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Maybe try: rdd2 = RDD.reduceByKey((x,y) = x+y) rdd2.take(3) -Abhishek- On Aug 20, 2015, at 3:05 AM, satish chandra j jsatishchan...@gmail.com wrote: HI All, I have data in RDD as mentioned below: RDD : Array[(Int),(Int)] =

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Cody Koeninger
Sounds like that's happening consistently, not an occasional network problem? Look at the Kafka broker logs Make sure you've configured the correct kafka broker hosts / ports (note that direct stream does not use zookeeper host / port). Make sure that host / port is reachable from your driver

Want to install lz4 compression

2015-08-21 Thread Saif.A.Ellafi
Hi all, I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop's native libraries, so I am not being able to use it. Can anyone suggest on how to proceed? Hopefully I wont have to recompile hadoop. I tried changing the --driver-library-path to point directly into lz4 stand

Re: Want to install lz4 compression

2015-08-21 Thread Ted Yu
Have you read this ? http://stackoverflow.com/questions/22716346/how-to-use-lz4-compression-in-linux-3-11 On Aug 21, 2015, at 6:57 AM, saif.a.ell...@wellsfargo.com saif.a.ell...@wellsfargo.com wrote: Hi all, I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop’s

Tungsten and sun.misc.Unsafe

2015-08-21 Thread Marek Kolodziej
Hello, I attended the Tungsten-related presentations at Spark Summit (by Josh Rosen) and at Big Data Scala (by Matei Zaharia). Needless to say, this project holds great promise for major performance improvements. At Josh's talk, I heard about the use of sun.misc.Unsafe as a way of achieving some

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
HI Abhishek, I have even tried that but rdd2 is empty Regards, Satish On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: You had: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Maybe try: rdd2 = RDD.reduceByKey((x,y) = x+y) rdd2.take(3)

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964
What version of Spark you are using, or comes with DSE 4.7? We just cannot reproduce it in Spark. yzhang@localhost$ more test.sparkval pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) = x + y).collectyzhang@localhost$ ~/spark/bin/spark-shell --master local -i

Spark ec2 lunch problem

2015-08-21 Thread Garry Chen
Hi All, I am trying to lunch a spark ec2 cluster by running spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cluster but getting following message endless. Please help. Warning: SSH connection error.

Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Chen Song
Thanks Sean. So how PySpark is supported. I thought PySpark needs jdk 1.6. Chen On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen so...@cloudera.com wrote: Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote: I tried to build Spark 1.4.1 on cdh 5.4.0.

Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Marcelo Vanzin
That was only true until Spark 1.3. Spark 1.4 can be built with JDK7 and pyspark will still work. On Fri, Aug 21, 2015 at 8:29 AM, Chen Song chen.song...@gmail.com wrote: Thanks Sean. So how PySpark is supported. I thought PySpark needs jdk 1.6. Chen On Fri, Aug 21, 2015 at 11:16 AM, Sean

build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Chen Song
I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support PySpark, I used JDK 1.6. I got the following error, [INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) @ spark-streaming_2.10 --- java.lang.UnsupportedClassVersionError:

Finding the number of executors.

2015-08-21 Thread Virgil Palanciuc
Is there any reliable way to find out the number of executors programatically - regardless of how the job is run? A method that preferably works for spark-standalone, yarn, mesos, regardless whether the code runs from the shell or not? Things that I tried and don't work: -

RE: Spark ec2 lunch problem

2015-08-21 Thread Garry Chen
No, the message never end. I have to ctrl-c out of it. Garry From: shahid ashraf [mailto:sha...@trialx.com] Sent: Friday, August 21, 2015 11:13 AM To: Garry Chen g...@cornell.edu Cc: user@spark.apache.org Subject: Re: Spark ec2 lunch problem Does the cluster work at the end ? On Fri, Aug 21,

Re: build spark 1.4.1 with JDK 1.6

2015-08-21 Thread Sean Owen
Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM Chen Song chen.song...@gmail.com wrote: I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support PySpark, I used JDK 1.6. I got the following error, [INFO] --- scala-maven-plugin:3.2.0:testCompile

Re: SparkSQL concerning materials

2015-08-21 Thread Sameer Farooqui
Have you seen the Spark SQL paper?: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf On Thu, Aug 20, 2015 at 11:35 PM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: Hi, thanks for answers. I have read answers you provided, but I rather look for some materials on the

Re: Spark ec2 lunch problem

2015-08-21 Thread shahid ashraf
Does the cluster work at the end ? On Fri, Aug 21, 2015 at 8:25 PM, Garry Chen g...@cornell.edu wrote: Hi All, I am trying to lunch a spark ec2 cluster by running spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc --subnet-id=subnet-011 --spark-version=1.4.1

Re: Data locality with HDFS not being seen

2015-08-21 Thread Sameer Farooqui
Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352 On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote: Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect the

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-21 Thread Burak Yavuz
Hi Naveen, As I mentioned before, the code is private therefore not accessible. Just copy and use the snippet that I sent. Copying it here again: https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L270

Re: Spark streaming multi-tasking during I/O

2015-08-21 Thread Rishitesh Mishra
Hi Sateesh, It is interesting to know , how did you determine that the Dstream runs on a single core. Did you mean receivers? Coming back to your question, could you not start disk io in a separate thread, so that the sceduler can go ahead and assign other tasks ? On 21 Aug 2015 16:06, Sateesh

RE: SparkR csv without headers

2015-08-21 Thread Felix Cheung
You could also rename them with names Unfortunately the API doesn't show the example of that https://spark.apache.org/docs/latest/api/R/index.html On Thu, Aug 20, 2015 at 7:43 PM -0700, Sun, Rui rui@intel.com wrote: Hi, You can create a DataFrame using load.df() with a specified schema.

Re: Spark streaming multi-tasking during I/O

2015-08-21 Thread Akhil Das
You can look at the spark.streaming.concurrentJobs by default it runs a single job. If set it to 2 then it can run 2 jobs parallely. Its an experimental flag, but go ahead and give it a try. On Aug 21, 2015 3:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, My scenario goes like this:

Re: Set custm worker id ?

2015-08-21 Thread Akhil Das
You can try adding a humanly readable entry in your /etc/hosts file of the worker machine and then you can set the SPARK_LOCAL_IP pointing to this hostname on that machines spark-env.sh file. On Aug 21, 2015 11:57 AM, saif.a.ell...@wellsfargo.com wrote: Hi, Is it possible in standalone to set

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin
You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov evgeny.a.moro...@gmail.com wrote: Hi, I'm using spark

Re: Spark ec2 lunch problem

2015-08-21 Thread Akhil Das
It may happen that the version of spark-ec2 script you are using is buggy or sometime AWS have problem provisioning machines. On Aug 21, 2015 7:56 AM, Garry Chen g...@cornell.edu wrote: Hi All, I am trying to lunch a spark ec2 cluster by running spark-ec2 --key-pair=key

Re: spark streaming 1.3 kafka error

2015-08-21 Thread Akhil Das
That looks like you are choking your kafka machine. Do a top on the kafka machines and see the workload, it may happen that you are spending too much time on disk io etc. On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote: Sounds like that's happening consistently, not an

How frequently should full gc we expect

2015-08-21 Thread java8964
In the test job I am running in Spark 1.3.1 in our stage cluster, I can see following information on the application stage information: MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7 min3.4 minGC Time11 s16 s21 s25 s54 s From the GC output log, I can see it is

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Nathan Skone
Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the aggregation. Imagine a dataframe representing website views with a userID, datetime, and a webpage address. How could I find the

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Akhil Das
Did you try sorting it by datetime and doing a groupBy on the userID? On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote: Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the

Re: Finding the number of executors.

2015-08-21 Thread Virgil Palanciuc
Hi Akhil, I'm using spark 1.4.1. Number of executors is not in the command line, not in the getExecutorMemoryStatus (I already mentioned that I tried that, works in spark-shell but not when executed via spark-submit). I tried looking at defaultParallelism too, it's 112 (7 executors * 16 cores)

Re: Finding the number of executors.

2015-08-21 Thread Akhil Das
Which version spark are you using? There was a discussion happened over here http://apache-spark-user-list.1001560.n3.nabble.com/Determine-number-of-running-executors-td19453.html

Re: Worker Machine running out of disk for Long running Streaming process

2015-08-21 Thread Tathagata Das
Could you periodically (say every 10 mins) run System.gc() on the driver. The cleaning up shuffles is tied to the garbage collection. On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma sharmagaura...@gmail.com wrote: Hi All, I have a 24x7 running Streaming Process, which runs on 2 hour windowed

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
@Cheng, Hao : Physical plans show that it got stuck on scanning S3! (table is partitioned by date_prefix and hour) explain select count(*) from test_table where date_prefix='20150819' and hour='00'; TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)] TungstenExchange

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Raghavendra Pandey
Did you try with hadoop version 2.7.1 .. It is known that s3a works really well with parquet which is available in 2.7. They fixed lot of issues related to metadata reading there... On Aug 21, 2015 11:24 PM, Jerrick Hoang jerrickho...@gmail.com wrote: @Cheng, Hao : Physical plans show that it

Re: Tungsten and sun.misc.Unsafe

2015-08-21 Thread Marek Kolodziej
Thanks Reynold, that helps a lot. I'm glad you're involved with that Google Doc community effort. I think it's because of that doc that the JEP's wording and scope changed for the better since it originally got introduced. Marek On Fri, Aug 21, 2015 at 11:18 AM, Reynold Xin r...@databricks.com

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Raghavendra Pandey
Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote: I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row

RE: Remoting warning when submitting to cluster

2015-08-21 Thread javidelgadillo
I believe this was caused by some network configuration on my machines. After installing VirtualBox, some new network interfaces were installed on the machines and the Akka software was binding to one of the VirtualBox interfaces and not the interface that belonged to my Ethernet card. Once I

Set custm worker id ?

2015-08-21 Thread Saif.A.Ellafi
Hi, Is it possible in standalone to set up worker ID names? to avoid the worker-19248891237482379-ip..-port ?? Thanks, Saif

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Impact
I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context:

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Dan LaBar
Nathan, I achieve this using rowNumber. Here is a Python DataFrame example: from pyspark.sql.window import Window from pyspark.sql.functions import desc, rowNumber yourOutputDF = ( yourInputDF .withColumn(first, rowNumber()

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
Is there a workaround without updating Hadoop? Would really appreciate if someone can explain what spark is trying to do here and what is an easy way to turn this off. Thanks all! On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Did you try with hadoop

Re: Finding the number of executors.

2015-08-21 Thread Du Li
Following is a method that retrieves the list of executors registered to a spark context. It worked perfectly with spark-submit in standalone mode for my project. /**   * A simplified method that just returns the current active/registered executors   * excluding the driver.   * @param sc   *    

Having Clause with variation and stddev

2015-08-21 Thread Ravisankar Mani
Hi, Exception thrown when using Having Clause with variation or stddev. It works perfectly when using other aggregate functions(Like sum,count,min,max..) SELECT SUM(1) AS `sum_number_of_records_ok` FROM `some_db`.`some_table` `some_table` GROUP BY 1 HAVING (STDDEV(1) 0) SELECT SUM(1) AS

Re: spark kafka partitioning

2015-08-21 Thread Gaurav Agarwal
when i send the message from kafka topic having three partitions. Spark will listen the message when i say kafkautils.createStream or createDirectstSream have local[4] Now i want to see if spark will create partitions when it receive message from kafka using dstream, how and where ,prwhich method

Re: SparkSQL concerning materials

2015-08-21 Thread Dawid Wysakowicz
Hi, thanks for answers. I have read answers you provided, but I rather look for some materials on the internals. E.g how the optimizer works, how the query is translated into rdd operations etc. The API I am quite familiar with. A good starting point for me was: Spark DataFrames: Simple and Fast

ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng
Hi, I was trying to programmatically specify a schema and apply it to a RDD of Rows and save the resulting DataFrame as a parquet file. Here's what I did: 1. Created an RDD of Rows from RDD[Array[String]]: val gameId= Long.valueOf(line(0)) val accountType = Long.valueOf(line(1)) val

Re: what determine the task size?

2015-08-21 Thread Robineast
The OP wants to understand what determines the size of the task code that is shipped to each executor so it can run the task. I don't know the answer to but would be interested to know too. Sent from my iPhone On 21 Aug 2015, at 08:26, oubrik [via Apache Spark User List]

Re: PySpark concurrent jobs using single SparkContext

2015-08-21 Thread Hemant Bhanawat
It seems like you want simultaneous processing of multiple jobs but at the same time serialization of few tasks within those jobs. I don't know how to achieve that in Spark. But, why would you bother about the inter-weaved processing when the data that is being aggregated in different jobs is per

Re:SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread Todd
please try DataFrame.toJSON, it will give you an RDD of JSON string. At 2015-08-21 15:59:43, smagadi sudhindramag...@fico.com wrote: val teenagers = sqlContext.sql(SELECT name FROM people WHERE age = 13 AND age = 19) I need teenagers to be a JSON object rather a simple row .How can we get

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
Yes, DSE 4.7 Regards, Satish Chandra On Fri, Aug 21, 2015 at 3:06 PM, Robin East robin.e...@xense.co.uk wrote: Not sure, never used dse - it’s part of DataStax Enterprise right? On 21 Aug 2015, at 10:07, satish chandra j jsatishchan...@gmail.com wrote: HI Robin, Yes, below mentioned

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
HI Robin, Yes, it is DSE but issue is related to Spark only Regards, Satish Chandra On Fri, Aug 21, 2015 at 3:06 PM, Robin East robin.e...@xense.co.uk wrote: Not sure, never used dse - it’s part of DataStax Enterprise right? On 21 Aug 2015, at 10:07, satish chandra j jsatishchan...@gmail.com

Re: SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread smagadi
teenagers .toJSON gives the json but it does not preserve the parent ids meaning if the input was {name:Yin, address:{city:Columbus,state:Ohio},age:20} val x= sqlContext.sql(SELECT name, address.city, address.state ,age FROM people where age19 and age =30 ).toJSON x.collect().foreach(println)

Worker Machine running out of disk for Long running Streaming process

2015-08-21 Thread gaurav sharma
Hi All, I have a 24x7 running Streaming Process, which runs on 2 hour windowed data The issue i am facing is my worker machines are running OUT OF DISK space I checked that the SHUFFLE FILES are not getting cleaned up.

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Jeff Zhang
Hi Satish, I don't see where spark support -i, so suspect it is provided by DSE. In that case, it might be bug of DSE. On Fri, Aug 21, 2015 at 6:02 PM, satish chandra j jsatishchan...@gmail.com wrote: HI Robin, Yes, it is DSE but issue is related to Spark only Regards, Satish Chandra

spark streaming 1.3 kafka error

2015-08-21 Thread Shushant Arora
Hi Getting below error in spark streaming 1.3 while consuming from kafka using directkafka stream. Few of tasks are getting failed in each run. What is the reason /solution of this error? 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in stage 130.0 (TID 16332)

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
HI Robin, Yes, below mentioned piece or code works fine in Spark Shell but the same when place in Script File and executed with -i file name it creating an empty RDD scala val pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40))) pairs: org.apache.spark.rdd.RDD[(Int, Int)] =

Re: Creating Spark DataFrame from large pandas DataFrame

2015-08-21 Thread ayan guha
The easiest option I found to put jars in SPARK CLASSPATH On 21 Aug 2015 06:20, Burak Yavuz brk...@gmail.com wrote: If you would like to try using spark-csv, please use `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` You're missing a dependency. Best, Burak On Thu, Aug 20, 2015

ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng
Hi, I was trying to programmatically specify a schema and apply it to a RDD of Rows and save the resulting DataFrame as a parquet file, but I got java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long on the last step. Here's what I did: 1. Created an RDD of Rows from

SPARK SQL support for XML

2015-08-21 Thread smagadi
Does spark sql supports XML the same way as it supports json ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-support-for-XML-tp24382.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark-Cassandra-connector

2015-08-21 Thread Ted Yu
Have you considered asking this question on https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user ? Cheers On Thu, Aug 20, 2015 at 10:57 PM, Samya samya.ma...@amadeus.com wrote: Hi All, I need to write an RDD to Cassandra using the sparkCassandraConnector from