Re: Apache Spark-Subtract two datasets

2017-10-12 Thread Imran Rajjad
if the datasets hold objects of different classes, then you will have to convert both of them to rdd and then rename the columns befrore you call rdd1.subtract(rdd2) On Thu, Oct 12, 2017 at 10:16 PM, Shashikant Kulkarni < shashikant.kulka...@gmail.com> wrote: > Hello, > > I have 2 datasets,

Re: Tasks not getting scheduled on specified executor host

2017-10-12 Thread Tarun Kumar
Any response on this one? On Fri, Oct 13, 2017 at 1:01 AM, Tarun Kumar wrote: > I am specifying executer host for my tasks via > org.apache.hadoop.fs.LocatedFileStatus#locations > (which is org.apache.hadoop.fs.BlockLocation[])'s hosts field, still > tasks are getting

Re: Spark - Partitions

2017-10-12 Thread Chetan Khatri
Use repartition On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" wrote: > Hi, > > I am reading hive query and wiriting the data back into hive after doing > some transformations. > > I have changed setting spark.sql.shuffle.partitions to 2000 and since then > job completes

Spark - Partitions

2017-10-12 Thread KhajaAsmath Mohammed
Hi, I am reading hive query and wiriting the data back into hive after doing some transformations. I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition size of file is 10 MB . is there a

interpretation of Spark logs about remote fetch latency

2017-10-12 Thread ricky l
Hello Spark users, I have an inquiry while analyzing a sample Spark task. The task has remote fetches (shuffle) from few blocks. However, the remote fetch time does not really make sense to me. Can someone please help to interpret this? The logs came from Spark REST API. The task ID 33 needs

Re: How to flatten a row in PySpark

2017-10-12 Thread ayan guha
Quick pyspark code: >>> s = "ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730" >>> base = sc.parallelize([s.split("|")]) >>> base.take(10) [['ABZ', 'ABZ', 'AF', '2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y', '1,2,3,4,5', '730']] >>> def pv(t): ... x = t[3].split(",") ... y =

Re: Java Rdd of String to dataframe

2017-10-12 Thread Stéphane Verlet
you can specify the schema programmatically https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema On Wed, Oct 11, 2017 at 3:35 PM, sk skk wrote: > Can we create a dataframe from a Java pair rdd of String . I don’t have a >

Checksum error during checkpointing

2017-10-12 Thread Lalwani, Jayesh
We have a Structured Streaming application running for more than 40 hours. We are storing checkpoints in EFS. The application is failing with a checksum error on the checkpoint (stack trace below) Is this because the checkpoint is corrupt? Is there a way to fix this? Is this a bug in Spark?

Tasks not getting scheduled on specified executor host

2017-10-12 Thread Tarun Kumar
I am specifying executer host for my tasks via org.apache.hadoop.fs.LocatedFileStatus#locations (which is org.apache.hadoop.fs.BlockLocation[])'s hosts field, still tasks are getting scheduled on different executor host. - Speculative execution is off. - Also confirmed that locality wait time out

Re: UnresolvedAddressException in Kubernetes Cluster

2017-10-12 Thread Matt Cheah
Hi there, This closely resembles https://github.com/apache-spark-on-k8s/spark/issues/523, and we’re having some discussion there to find some possible root causes. However, what release of the fork are you working off of? Are you using the HEAD of branch-2.2-kubernetes, or something else?

Re: How to flatten a row in PySpark

2017-10-12 Thread Nicholas Hakobian
Using explode on the 4th column, followed by an explode on the 5th column would produce what you want (you might need to use split on the columns first if they are not already an array). Nicholas Szandor Hakobian, Ph.D. Staff Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Thu,

Partition and Sort by together

2017-10-12 Thread Siva Gudavalli
Hello, I have my data stored in parquet file format. My data Is already partitioned by dates and keyNow I want my data in each file to be sorted by a new Code column.  date1    -> key1             -> paqfile1             ->paqfile2     ->key2             ->paqfile1             ->paqfile2 date2 

Apache Spark-Subtract two datasets

2017-10-12 Thread Shashikant Kulkarni
Hello, I have 2 datasets, Dataset and other is Dataset. I want the list of records which are in Dataset but not in Dataset. How can I do this in Apache Spark using Java Connector? I am using Apache Spark 2.2.0 Thank you - To

How to flatten a row in PySpark

2017-10-12 Thread Debabrata Ghosh
Hi, Greetings ! I am having data in the format of the following row: ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730 I want to convert it into several rows in the format below: ABZ|ABZ|AF|2|1|730 ABZ|ABZ|AF|3+1|730 . . . ABZ|ABZ|AF|3|1|730 ABZ|ABZ|AF|3|2|730

Reading Array Type of Struct Column from a Dataset

2017-10-12 Thread ashmeet kandhari
Hi, I'm trying to read a dataset column which is of type array of structType using Java. Can someone guide me for that column how can i read/iterate,update or delete single or more than one elements of the array of a single row efficiently. Sample schema of that column is root |-- Person:

Re: Why does Spark need to set log levels

2017-10-12 Thread Steve Loughran
> On 9 Oct 2017, at 16:49, Daan Debie wrote: > > Hi all! > > I would love to use Spark with a somewhat more modern logging framework than > Log4j 1.2. I have Logback in mind, mostly because it integrates well with > central logging solutions such as the ELK stack. I've