date:20180404

Re: 1 Executor per partition

2018-04-04 Thread utkarsh_deep

You are correct.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best way to Hive to Spark migration

2018-04-04 Thread Jörn Franke

You need to provide more context on what you do currently in Hive and what do 
you expect from the migration.

> On 5. Apr 2018, at 05:43, Pralabh Kumar  wrote:
> 
> Hi Spark group
> 
> What's the best way to Migrate Hive to Spark
> 
> 1) Use HiveContext of Spark
> 2) Use Hive on Spark 
> (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> 3) Migrate Hive to Calcite to Spark SQL
> 
> 
> Regards
>

Best way to Hive to Spark migration

2018-04-04 Thread Pralabh Kumar

Hi Spark group

What's the best way to Migrate Hive to Spark

1) Use HiveContext of Spark
2) Use Hive on Spark (
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
)
3) Migrate Hive to Calcite to Spark SQL


Regards

Spark uses more threads than specified in local[n]

2018-04-04 Thread Xiangyu Li

Hi,

I am running a job in local mode, configured with local[1] for the sake of
the example. The timeline view in Spark UI is as follows:




It shows that there are actually two threads running, though their
overlapping is very small. To validate this, I also added some change to
Spark's task runner that prints `Thread.currentThread().getId()`, and there
indeed prints out two thread ids.

Is this some behavior from Spark's thread pool to improve performance?

-- 
Sincerely
Xiangyu Li

how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-04 Thread Andy Davidson

I am having a heck of a time setting up my development environment. I used
pip to install pyspark. I also downloaded spark from apache.

My eclipse pyDev intereperter is configured as a python3 virtualenv

I have a simple unit test that loads a small dataframe. Df.show() generates
the following error


2018-04-04 17:13:56 ERROR Executor:91 - Exception in task 0.0 in stage 0.0
(TID 0)

org.apache.spark.SparkException:

Error from python worker:

  Traceback (most recent call last):

File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site.py",
line 67, in 

  import os

File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/os.py",
line 409

  yield from walk(new_path, topdown, onerror, followlinks)

   ^

  SyntaxError: invalid syntax





My unittest classs is dervied from.



class PySparkTestCase(unittest.TestCase):



@classmethod

def setUpClass(cls):

conf = SparkConf().setMaster("local[2]") \

.setAppName(cls.__name__) #\

# .set("spark.authenticate.secret", "11")

cls.sparkContext = SparkContext(conf=conf)

sc_values[cls.__name__] = cls.sparkContext

cls.sqlContext = SQLContext(cls.sparkContext)

print("aedwip:", SparkContext)



@classmethod

def tearDownClass(cls):

print("calling stop tearDownClas, the content of sc_values=",
sc_values)

sc_values.clear()

cls.sparkContext.stop()



This looks similar to Class  PySparkTestCase in
https://github.com/apache/spark/blob/master/python/pyspark/tests.py



Any suggestions would be greatly appreciated.



Andy



My downloaed version is spark-2.3.0-bin-hadoop2.7



My virtual env version is

(spark-2.3.0) $ pip show pySpark

Name: pyspark

Version: 2.3.0

Summary: Apache Spark Python API

Home-page: https://github.com/apache/spark/tree/master/python

Author: Spark Developers

Author-email: d...@spark.apache.org

License: http://www.apache.org/licenses/LICENSE-2.0

Location: 
/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages

Requires: py4j

(spark-2.3.0) $ 



(spark-2.3.0) $ python --version

Python 3.6.1

(spark-2.3.0) $

Re: trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

2018-04-04 Thread Gourav Sengupta

Hi,

the way I manage things is, download spark, and set SPARK_HOME and the
import findspark and run findspark.init(). And everything else works just
fine.

I have never tried pip install pyspark though.


Regards,
Gourav Sengupta

On Wed, Apr 4, 2018 at 11:28 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> I am having trouble setting up my python3 virtualenv.
>
> I created a virtualenv ‘spark-2.3.0’ Installed pyspark using pip how ever
> I am not able to import pyspark.sql.functions. I get “unresolved import”
> when I try to import col() and lit()
>
> from pyspark.sql.functions import *
>
>
> I found if I download spark from apache and set SPARK_ROOT I can get my
> juypter notebook to work. This is a very error prone work around. I am
> having simiilar problem with my eclipse pyDev virtualenv
>
> Any suggestions would be greatly appreciated
>
> Andy
>
>
> # pip show in virtualenv
>
> (spark-2.3.0) $ pip show pyspark
>
> Name: pyspark
>
> Version: 2.3.0
>
> Summary: Apache Spark Python API
>
> Home-page: https://github.com/apache/spark/tree/master/python
>
> Author: Spark Developers
>
> Author-email: d...@spark.apache.org
>
> License: http://www.apache.org/licenses/LICENSE-2.0
>
> Location: /Users/foo/workSpace/pythonEnv/spark-2.3.0/lib/
> python3.6/site-packages
>
> Requires: py4j
>
> (spark-2.3.0) $
>
> (spark-2.3.0) $ ls ~/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-
> packages/pyspark/sql/functions.py
>
> ~/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages/pyspark/sql/
> functions.py
>
>
> # Jupyter Notebook
>
> Export SPARK_ROOT=~/workSpace/spark/spark-2.3.0-bin-hadoop2.7
>
>
>
> Eclipse pyDev virtual ENV
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

2018-04-04 Thread Andy Davidson

I am having trouble setting up my python3 virtualenv.

I created a virtualenv spark-2.3.0¹ Installed pyspark using pip how ever I
am not able to import pyspark.sql.functions. I get ³unresolved import² when
I try to import col() and lit()

from pyspark.sql.functions import *


I found if I download spark from apache and set SPARK_ROOT I can get my
juypter notebook to work. This is a very error prone work around. I am
having simiilar problem with my eclipse pyDev virtualenv

Any suggestions would be greatly appreciated

Andy


# pip show in virtualenv

(spark-2.3.0) $ pip show pyspark

Name: pyspark

Version: 2.3.0

Summary: Apache Spark Python API

Home-page: https://github.com/apache/spark/tree/master/python

Author: Spark Developers

Author-email: d...@spark.apache.org

License: http://www.apache.org/licenses/LICENSE-2.0

Location: 
/Users/foo/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages

Requires: py4j

(spark-2.3.0) $ 


(spark-2.3.0) $ ls 
~/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages/pyspark/sql/fu
nctions.py 

~/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages/pyspark/sql/fu
nctions.py


# Jupyter Notebook
Export SPARK_ROOT=~/workSpace/spark/spark-2.3.0-bin-hadoop2.7




Eclipse pyDev virtual ENV






-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: 1 Executor per partition

2018-04-04 Thread Gourav Sengupta

Each partition should be translated into one task which should run in one
executor. But one executor can process more than one task. I may be wrong,
and will be grateful if someone can correct me.

Regards,
Gourav

On Wed, Apr 4, 2018 at 8:13 PM, Thodoris Zois  wrote:

>
> Hello list!
>
> I am trying to familiarize with Apache Spark. I  would like to ask
> something about partitioning and executors.
>
> Can I have e.g: 500 partitions but launch only one executor that will run
> operations in only 1 partition of the 500? And then I would like my job to
> die.
>
> Is there any easy way? Or i have to modify code to achieve that?
>
> Thank you,
>  Thodoris
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

1 Executor per partition

2018-04-04 Thread Thodoris Zois


Hello list!

I am trying to familiarize with Apache Spark. I  would like to ask something 
about partitioning and executors. 

Can I have e.g: 500 partitions but launch only one executor that will run 
operations in only 1 partition of the 500? And then I would like my job to die. 

Is there any easy way? Or i have to modify code to achieve that?

Thank you,
 Thodoris

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Issue with nested JSON parsing in to data frame

2018-04-04 Thread Ritesh Shah

Hello,

I am using Apache Spark 2.2.1 with Scala. I am trying to load below JSON from 
Kafka and trying to extract "JOBTYPE" and "LOADID" from the nested JSON object. 
 Need help with extraction logic.

Code

val workRequests = new StructType().add("after", new StructType()
  .add("JOBTYPE", StringType)
  .add("LOADID", StringType))

val dfn = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", 
"zlp21299.vci.att.com:29092,zlp21301.vci.att.com:29093,zlp21318.vci.att.com:29094")
.option("subscribe", "edgemw")
.load()


  val df2 = dfn.selectExpr("CAST(value AS STRING)").as[(String)]
  .select(from_json($"value", workRequests)).as("data")
  .select("data.*")

  df2.writeStream
  .format("console")
  .option("truncate","false")
  .option("numRows", 10)
  .start()
  .awaitTermination()


Desired Output

JOBTYPE | LOADID
--
Val 1  | Val 1
Val 2  | Val 2

JSON

{"table":"FORCEMW.XBFWR_T","op_type":"U","op_ts":"2018-04-03 
15:10:35.036322","current_ts":"2018-04-03T15:10:39.269000","pos":"0296150084924838","before":{},"after":{"WRID":93694148,"LOADID":"OH_OHIO_I","SOURCETYPE":"RWG","EXTKEY1":"S0010","WORKTYPE":"RT","WORKTYPEQUALIFIER":"IN","TIMESTAMP":"2018-04-03:12:41:49","EXTKEY2":"NWCMOH49
 S0010 00130
","WRSTAT":"ASSIGN","WRSTATQUAL":null,"JOBTYPE":"S0010","ESTDURATION":"28","ESTSETUP":"0","PRIORITY":"55","LOADSTARTDATETIME":"2018-04-01:11:00:00","LOADENDDATETIME":"2018-04-11:03:59:00","ASSIGNDATE":"2018-04-03:00:00:00","COMMITMENTDATETIME":"2018-05-01:03:59:00","LATITUDE":40.273643,"LONGITUDE":-81.607832,"ASSIGNCOORDHORIZ":null,"ASSIGNCOORDVERT":null,"ACCESSBEGINDATETIME":null,"ACCESSENDDATETIME":null,"DISPATCH2AREA":"N","LATESTSTARTDATETIME":null,"TECHPREFERRED":null,"OWNERID":"Force","TECHASSIGNED":"MM4341","UNSCHEDULED":"N","UNLOCATED":"N","ACCESSCODE":null,"COMPCANDATETIME":null,"APPOINTMENTDATETIME":null,"CLOSEOUTVALIDATIONRQD":null,"CUSTADVISEDNAMERQD":null,"CUSTADVISEDREACHNBRRQD":null,"DISPATCHDATE":null,"DISPATCHDATECLASS":null,"DMNAENDDATETIME1":null,"DMNAENDDATETIME2":null,"DMNAFLAG1":null,"DMNAFLAG2":null,"DMNASTARTDATETIME1":null,"DMNASTARTDATETIME2":null,"ESTORIGDURATION":"28","MANUALLOCKDATETIME":null,"MANUALLOCKEDFLAG":"N","MISSINGDESIGNDATE":null,"PCTCOMPLETE":0,"TASKDURATIONAREANAME":null,"TRAVELTIMEAREANAME":null,"SUMMARYRQD":null,"TRAVELDATETIMEFACTOR":0,"UPDATEDWHILEDISPATCHED":"N","VISITFLAG":"Y","EXTDESTINATION":null,"LOCQUALIFIER":"001","ROUTETOCODE":null,"AUXSTATUS":null,"CURRENTSOURCESTATUS":null,"LASTDISPATCHSOURCESTATUS":null,"LASTSOURCESTATUSDATETIME":null,"APPOINTMENTWINDOWEND":null,"VERSIONNBR":"9","SEARCHDATE":null,"AUXSTATQUAL":null,"EXCEEDPRIORITYTHRES":"N","CANCELORDER":null,"CIRCUITWORKLOCCLLI":"NWCMOH49","JOBTYPEREQUIRESAPPROVAL":"N","APPROVEDFORDISPATCH":null,"WIT":"W","MINIMUMASSIGNTIME":null,"RTNTYPE":"PM","RTNBEGINTIME":null,"RTNENDTIME":null,"ASSIGNABLEFLAG":"Y","DUEDATEFLAG":"N","GROUPKEY":null,"PRIORITIZATIONAREA":null,"ORDDUEDATE":null,"SAMELOCATIONINDICATOR":"Y","ESTIMATEDSTARTDATETIME":"2018-04-03:18:40:45","ESTIMATEDCOMPLETIONDATETIME":"2018-04-03:18:54:45","OBJECTTIMEZONE":"EASTERN","LINKKEY":null,"CUSTNAME":null,"SERVICEADDRSTRING":null,"RESERVATIONID":null,"RESERVATIONSTATQUAL":null,"RSVTECHASSIGNED":null,"RSVASSIGNDATE":null,"RSVMATCHFLAG":null,"RSVCNTFLAG":null,"RSVOVRBKFLAG":"N","RSVDETAILS":null,"REISSUEFLAG":"Y","OOSINDICATOR":null,"DELAYDISPATCHFLAG":null,"INTOWFLAG":null,"HELPERFLAG":null,"NUMTASKSCOMPLETED":null,"HONORFUTUREACCESSFLAG":"N","CALLOUTASSIGNFLAG":null,"TECHLOADTYPE":null,"OVERRIDELOADPARAMSIND":"N","LOCATIONPROXIMITYCODE":null,"OOSMEMBERCOUNT":null,"PMAFLAG":null,"OVERRIDEEVALUATEIND":null,"REPORTEDDATETIME":null,"IMMEDDISPIND":null,"RELATEDTROUBLEREPORT":null,"DISPCOORDFLAG":null,"CUSTEMAIL":null,"CUSTNOTIFY":null,"ESTIMATEDARRIVALDATETIME":null,"DISPCOORDKEY":null,"DISPCOORDSEQ":null,"SOURCEQUALIFIER":null,"APDFIELDLOCKS":0,"RESTRICTACCESSCANDIDATEFLAG":null,"ALAPINDICATOR":null,"ENTRYDATETIME":"2018-03-19:18:02:29","JEPRELEASEDATE":null,"JOBLISTDATA":null,"STEPSEQNBR":null,"CMDCNTRLFLAGS":null}}


Disclaimer:  This message and the information contained herein is proprietary 
and confidential and subject to the Tech Mahindra policy statement, you may 
review the policy at http://www.techmahindra.com/Disclaimer.html 
 externally 
http://tim.techmahindra.com/tim/disclaimer.html 
 internally within 
TechMahindra.

Re: Scala program to spark-submit on k8 cluster

2018-04-04 Thread purna pradeep

yes “REST application that submits a Spark job to a k8s cluster by running
spark-submit programmatically” and also would like to expose as a
 Kubernetes service so that clients can access as any other Rest api

On Wed, Apr 4, 2018 at 12:25 PM Yinan Li  wrote:

> Hi Kittu,
>
> What do you mean by "a Scala program"? Do you mean a program that submits
> a Spark job to a k8s cluster by running spark-submit programmatically, or
> some example Scala application that is to run on the cluster?
>
> On Wed, Apr 4, 2018 at 4:45 AM, Kittu M  wrote:
>
>> Hi,
>>
>> I’m looking for a Scala program to spark submit a Scala application
>> (spark 2.3 job) on k8 cluster .
>>
>> Any help  would be much appreciated. Thanks
>>
>>
>>
>

Re: Scala program to spark-submit on k8 cluster

2018-04-04 Thread Yinan Li

Hi Kittu,

What do you mean by "a Scala program"? Do you mean a program that submits a
Spark job to a k8s cluster by running spark-submit programmatically, or
some example Scala application that is to run on the cluster?

On Wed, Apr 4, 2018 at 4:45 AM, Kittu M  wrote:

> Hi,
>
> I’m looking for a Scala program to spark submit a Scala application (spark
> 2.3 job) on k8 cluster .
>
> Any help  would be much appreciated. Thanks
>
>
>

ClassCastException: java.sql.Date cannot be cast to java.lang.String in Scala

2018-04-04 Thread anbu

Could you someone please help me how to fix this below error in spark 2.1.0
scala-2.11.8 Baically I'm migrating the code from spark 1.6.0 to
spark-2.1.0.

I'm getting the below exception in spark 2.1.0

Error: java.lang.ClassCastException: java.sql.Date cannot be cast to
java.lang.String at org.apache.spark.sql.Row$class.getString(Row.scala)

at
org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala)
at

Existing code in spark 1.6.0

import java.text.SimpleDateFormat

val someRDD =RDD.groupByKey().mapPartitions

{ (iterator) => { val dataFormatter = new SimpleDateFormat("-MM-dd")
myList.map(ip => Row(ip._1.getInt(0),ip._1.getLong(1),

ip._1.getInt(2), new
java.sql.Date(dataFormatter.parse(ip._1.getString(7)).getTime),

new java.sql.Date(dataFormatter.parse(ip._1.getString(8)).getTime),

error is showing in the above 2 lines.

)).iterator}

I have tried the different approach with the following ways.

import java.sql.Date

Date.valueOf(ip._1.getString(7)).getTime,
Date.valueOf(ip._1.getString(8)).getTime,

Still I'm getting the same error saying "Caused by:
java.lang.ClassCastException: java.sql.Date cannot be cast to
java.lang.String at org.apache.spark.sql.Row$class.getString(Row.scala) at
org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala)"

Please help me on this error in spark 2.1.0

Suggest me any site or document for spark migration



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Scala program to spark-submit on k8 cluster

2018-04-04 Thread Kittu M

Hi,

I’m looking for a Scala program to spark submit a Scala application (spark
2.3 job) on k8 cluster .

Any help  would be much appreciated. Thanks

Re: NumberFormatException while reading and split the file

2018-04-04 Thread utkarsh_deep

Response to the 1st approach:

When you do spark.read.text("/xyz/a/b/filename") it returns a DataFrame and
when applying the rdd methods gives you a RDD[Row], so when you use map,
your function get Row as the parameter i.e; ip in your code. Therefore you
must use the Row methods to access its members.
The error message says it clearly "error :  value split is not a member of
org.apache.spark.sql.Row" that there is no method like split so it is
throwing error.



Response to the 2nd approach:

There is something fishy there. The if condition in Row ip(0).isEmpty()
should catch the case when it is an empty string so when it is not actually
empty ip(0).toInt shouldn't fail. But also you need to make sure ip(0) is
not just some random string which can't be converted to Int.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to delete empty columns in df when writing to parquet?

2018-04-04 Thread Junfeng Chen

Our users ask for it


Regard,
Junfeng Chen

On Wed, Apr 4, 2018 at 5:45 PM, Gourav Sengupta 
wrote:

> Hi Junfeng,
>
> can I ask why it is important to remove the empty column?
>
> Regards,
> Gourav Sengupta
>
> On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen  wrote:
>
>> I am trying to read data from kafka and writing them in parquet format
>> via Spark Streaming.
>> The problem is, the data from kafka are in variable data structure. For
>> example, app one has columns A,B,C, app two has columns B,C,D. So the data
>> frame I read from kafka has all columns ABCD. When I decide to write the
>> dataframe to parquet file partitioned with app name,
>> the parquet file of app one also contains columns D, where the columns D
>> is empty and it contains no data actually. So how to filter the empty
>> columns when I writing dataframe to parquet?
>>
>> Thanks!
>>
>>
>> Regard,
>> Junfeng Chen
>>
>
>

Re: How to delete empty columns in df when writing to parquet?

2018-04-04 Thread Gourav Sengupta

Hi Junfeng,

can I ask why it is important to remove the empty column?

Regards,
Gourav Sengupta

On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen  wrote:

> I am trying to read data from kafka and writing them in parquet format via
> Spark Streaming.
> The problem is, the data from kafka are in variable data structure. For
> example, app one has columns A,B,C, app two has columns B,C,D. So the data
> frame I read from kafka has all columns ABCD. When I decide to write the
> dataframe to parquet file partitioned with app name,
> the parquet file of app one also contains columns D, where the columns D
> is empty and it contains no data actually. So how to filter the empty
> columns when I writing dataframe to parquet?
>
> Thanks!
>
>
> Regard,
> Junfeng Chen
>

Building Datwarehouse Application in Spark

2018-04-04 Thread Mahender Sarangam

Hi,
Does anyone has good architecture document/design principle for building 
warehouse application using Spark.

Is it better way of having Hive Context created with HQL and perform 
transformation or Directly loading  files in dataframe and perform data 
transformation.

We need to implement SCD 2 Type in Spark, Is there any better 
document/reference for building Type 2 warehouse object

Thanks in advace

/Mahender

Re: run huge number of queries in Spark

2018-04-04 Thread Georg Heiler

See https://gist.github.com/geoHeil/e0799860262ceebf830859716bbf in
particular:

You will probably want to use sparks imperative (non SQL) API:
.rdd
.reduceByKey {
(count1, count2) => count1 + count2
}.map {
case ((word, path), n) => (word, (path, n))
}.toDF
i.e. builds an inverted index
which easily lets you then calculate TF / IDF
But spark also comes with
https://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf which
might help you to easily achieve the desired result.

Donni Khan  schrieb am Mi., 4. Apr. 2018 um
10:56 Uhr:

> Hi all,
>
> I want to run huge number of queries on Dataframe in Spark. I have a big
> data of text documents, I loded all documents into SparkDataFrame and
> create a temp table.
>
> dataFrame.registerTempTable("table1");
>
> I have more than 50,000 terms, I want to get the document frequency for
> each by using the "table1".
>
> I use the follwing:
>
> DataFrame df=sqlContext.sql("select count(ID) from table1 where text like
> '%"+term+"%'");
>
> but this scenario needs much time to finish because I have t run it from
> Spark Driver for each term.
>
>
> Does anyone has idea how I can run all queries in distributed way?
>
> Thank you && Best Regards,
>
> Donni
>
>
>
>

run huge number of queries in Spark

2018-04-04 Thread Donni Khan

Hi all,

I want to run huge number of queries on Dataframe in Spark. I have a big
data of text documents, I loded all documents into SparkDataFrame and
create a temp table.

dataFrame.registerTempTable("table1");

I have more than 50,000 terms, I want to get the document frequency for
each by using the "table1".

I use the follwing:

DataFrame df=sqlContext.sql("select count(ID) from table1 where text like
'%"+term+"%'");

but this scenario needs much time to finish because I have t run it from
Spark Driver for each term.


Does anyone has idea how I can run all queries in distributed way?

Thank you && Best Regards,

Donni

NumberFormatException while reading and split the file

2018-04-04 Thread anbu

1st Approach:

error :  value split is not a member of org.apache.spark.sql.Row?

val newRdd = spark.read.text("/xyz/a/b/filename").rdd

anotherRDD = newRdd.
map(ip =>ip.split("\\|")).map(ip => Row(if (ip(0).isEmpty()) {
null.asInstanceOf[Int] }
else ip(0).toInt, ip(1),
ip(2), ip(3), ip(4), ip(5))


I'm getting the error in the  line 'ip.split("\\|")' value split is not a
member of org.apache.spark.sql.Row?
 
 
Another approach:
 
 error:"java.lang.NumberFormatException: For input string:
 
 
 val newRdd = spark.read.text("/xyz/a/b/filename").rdd

anotherRDD = newRdd.
map(ip =>ip.toString().split("\\|")).map(ip => Row(if (ip(0).isEmpty())
{ null.asInstanceOf[Int] }
else ip(0).toInt, ip(1),
ip(2), ip(3), ip(4), ip(5))

anotherRDD.collect().foreach(println)   
In this case I'm getting the error "java.lang.NumberFormatException: For
input string: ""




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: 1 Executor per partition

Re: Best way to Hive to Spark migration

Best way to Hive to Spark migration

Spark uses more threads than specified in local[n]

how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

Re: trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

Re: 1 Executor per partition

1 Executor per partition

Issue with nested JSON parsing in to data frame

Re: Scala program to spark-submit on k8 cluster

Re: Scala program to spark-submit on k8 cluster

ClassCastException: java.sql.Date cannot be cast to java.lang.String in Scala

Scala program to spark-submit on k8 cluster

Re: NumberFormatException while reading and split the file

Re: How to delete empty columns in df when writing to parquet?

Re: How to delete empty columns in df when writing to parquet?

Building Datwarehouse Application in Spark

Re: run huge number of queries in Spark

run huge number of queries in Spark

NumberFormatException while reading and split the file

21 matches

Site Navigation

Mail list logo

Footer information