Unsubscribe

2018-09-06 Thread Anu B Nair
Hi,

I have tried all possible way to unsubscripted from this group. Can anyone
help?

--
Anu


Re: getting error: value toDF is not a member of Seq[columns]

2018-09-06 Thread Mich Talebzadeh
I am trying to understand why spark cannot convert a simple comma separated
columns as DF.

I did a test

I took one line of print and stored it as a one liner csv file as below

var allInOne = key+","+ticker+","+timeissued+","+price
println(allInOne)

cat crap.csv
6e84b11d-cb03-44c0-aab6-37e06e06c996,MRW,2018-09-06T09:35:53,275.45

Then after storing it in HDFS, I read that file as below

import org.apache.spark.sql.functions._
val location="hdfs://rhes75:9000/tmp/crap.csv"
val df1 = spark.read.option("header", false).csv(location)
case class columns(KEY: String, TICKER: String, TIMEISSUED: String, PRICE:
Double)
val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
p(2).toString,p(3).toString.toDouble))
df2.printSchema

This is the result I get

df1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2 more
fields]
defined class columns
df2: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER: string
... 2 more fields]
root
 |-- KEY: string (nullable = true)
 |-- TICKER: string (nullable = true)
 |-- TIMEISSUED: string (nullable = true)
 |-- PRICE: double (nullable = false)

So in my case the only difference is that that comma separated line is
stored in a String as opposed to csv.

How can I achieve this simple transformation?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Sep 2018 at 03:38, Manu Zhang  wrote:

> Have you tried adding Encoder for columns as suggested by Jungtaek Lim ?
>
> On Thu, Sep 6, 2018 at 6:24 AM Mich Talebzadeh 
> wrote:
>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> I can rebuild the comma separated list as follows:
>>
>>
>>case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
>> PRICE: Float)
>> val sqlContext= new org.apache.spark.sql.SQLContext(sparkContext)
>> import sqlContext.implicits._
>>
>>
>>  for(line <- pricesRDD.collect.toArray)
>>  {
>>var key = line._2.split(',').view(0).toString
>>var ticker =  line._2.split(',').view(1).toString
>>var timeissued = line._2.split(',').view(2).toString
>>var price = line._2.split(',').view(3).toFloat
>>var allInOne = key+","+ticker+","+timeissued+","+price
>>println(allInOne)
>>
>> and the print shows the columns separated by ","
>>
>>
>> 34e07d9f-829a-446a-93ab-8b93aa8eda41,SAP,2018-09-05T23:22:34,56.89
>>
>> So I just need to convert that line of rowinto a DataFrame
>>
>> I try this conversion to DF to write to MongoDB document with 
>> MongoSpark.save(df,
>> writeConfig)
>>
>> var df = sparkContext.parallelize(Seq(columns(key, ticker, timeissued,
>> price))).toDF
>>
>> [error]
>> /data6/hduser/scala/md_streaming_mongoDB/src/main/scala/myPackage/md_streaming_mongoDB.scala:235:
>> value toDF is not a member of org.apache.spark.rdd.RDD[columns]
>> [error] var df = sparkContext.parallelize(Seq(columns(key,
>> ticker, timeissued, price))).toDF
>> [
>>
>>
>> frustrating!
>>
>>  has anyone come across this?
>>
>> thanks
>>
>> On Wed, 5 Sep 2018 at 13:30, Mich Talebzadeh 
>> wrote:
>>
>>> yep already tried it and it did not work.
>>>
>>> thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 5 Sep 2018 at 10:10, Deepak Sharma 
>>> wrote:
>>>
 Try this:

 *imp

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-06 Thread Jungtaek Lim
This code works with Spark 2.3.0 via spark-shell.

scala> case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
PRICE: Float)
defined class columns

scala> import spark.implicits._
import spark.implicits._

scala> var df = Seq(columns("key", "ticker", "timeissued", 1.23f)).toDF
18/09/06 18:02:23 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
more fields]

scala> df
res0: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
more fields]

Maybe need to know about actual type of key, ticker, timeissued, price from
your variables.

Jungtaek Lim (HeartSaVioR)

2018년 9월 6일 (목) 오후 5:57, Mich Talebzadeh 님이 작성:

> I am trying to understand why spark cannot convert a simple comma
> separated columns as DF.
>
> I did a test
>
> I took one line of print and stored it as a one liner csv file as below
>
> var allInOne = key+","+ticker+","+timeissued+","+price
> println(allInOne)
>
> cat crap.csv
> 6e84b11d-cb03-44c0-aab6-37e06e06c996,MRW,2018-09-06T09:35:53,275.45
>
> Then after storing it in HDFS, I read that file as below
>
> import org.apache.spark.sql.functions._
> val location="hdfs://rhes75:9000/tmp/crap.csv"
> val df1 = spark.read.option("header", false).csv(location)
> case class columns(KEY: String, TICKER: String, TIMEISSUED: String, PRICE:
> Double)
> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
> p(2).toString,p(3).toString.toDouble))
> df2.printSchema
>
> This is the result I get
>
> df1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2 more
> fields]
> defined class columns
> df2: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER: string
> ... 2 more fields]
> root
>  |-- KEY: string (nullable = true)
>  |-- TICKER: string (nullable = true)
>  |-- TIMEISSUED: string (nullable = true)
>  |-- PRICE: double (nullable = false)
>
> So in my case the only difference is that that comma separated line is
> stored in a String as opposed to csv.
>
> How can I achieve this simple transformation?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 6 Sep 2018 at 03:38, Manu Zhang  wrote:
>
>> Have you tried adding Encoder for columns as suggested by Jungtaek Lim ?
>>
>> On Thu, Sep 6, 2018 at 6:24 AM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> I can rebuild the comma separated list as follows:
>>>
>>>
>>>case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
>>> PRICE: Float)
>>> val sqlContext= new org.apache.spark.sql.SQLContext(sparkContext)
>>> import sqlContext.implicits._
>>>
>>>
>>>  for(line <- pricesRDD.collect.toArray)
>>>  {
>>>var key = line._2.split(',').view(0).toString
>>>var ticker =  line._2.split(',').view(1).toString
>>>var timeissued = line._2.split(',').view(2).toString
>>>var price = line._2.split(',').view(3).toFloat
>>>var allInOne = key+","+ticker+","+timeissued+","+price
>>>println(allInOne)
>>>
>>> and the print shows the columns separated by ","
>>>
>>>
>>> 34e07d9f-829a-446a-93ab-8b93aa8eda41,SAP,2018-09-05T23:22:34,56.89
>>>
>>> So I just need to convert that line of rowinto a DataFrame
>>>
>>> I try this conversion to DF to write to MongoDB document with 
>>> MongoSpark.save(df,
>>> writeConfig)
>>>
>>> var df = sparkContext.parallelize(Seq(columns(key, ticker, timeissued,
>>> price))).toDF
>>>
>>> [error]
>>> /data6/hduser/scala/md_streaming_mongoDB/src/main/scala/myPackage/md_streaming_mongoDB.scala:235:
>>> value toDF is not a member of org.apache.spark.rdd.RDD[columns]
>>> [error] var df = sparkContext.parallelize(Seq(columns(key,
>>> ticker, timeissued, price))).toDF
>>> [
>>>
>>>
>>> frus

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-06 Thread Mich Talebzadeh
thanks if you define columns class as below


scala> case class columns(KEY: String, TICKER: String, TIMEISSUED:
String, *PRICE:
Double)*
defined class columns
scala> var df = Seq(columns("key", "ticker", "timeissued", 1.23f)).toDF
df: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
more fields]
scala> df.printSchema
root
 |-- KEY: string (nullable = true)
 |-- TICKER: string (nullable = true)
 |-- TIMEISSUED: string (nullable = true)
 |-- PRICE: double (nullable = false)

looks better

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Sep 2018 at 10:10, Jungtaek Lim  wrote:

> This code works with Spark 2.3.0 via spark-shell.
>
> scala> case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
> PRICE: Float)
> defined class columns
>
> scala> import spark.implicits._
> import spark.implicits._
>
> scala> var df = Seq(columns("key", "ticker", "timeissued", 1.23f)).toDF
> 18/09/06 18:02:23 WARN ObjectStore: Failed to get database global_temp,
> returning NoSuchObjectException
> df: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
> more fields]
>
> scala> df
> res0: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
> more fields]
>
> Maybe need to know about actual type of key, ticker, timeissued, price
> from your variables.
>
> Jungtaek Lim (HeartSaVioR)
>
> 2018년 9월 6일 (목) 오후 5:57, Mich Talebzadeh 님이 작성:
>
>> I am trying to understand why spark cannot convert a simple comma
>> separated columns as DF.
>>
>> I did a test
>>
>> I took one line of print and stored it as a one liner csv file as below
>>
>> var allInOne = key+","+ticker+","+timeissued+","+price
>> println(allInOne)
>>
>> cat crap.csv
>> 6e84b11d-cb03-44c0-aab6-37e06e06c996,MRW,2018-09-06T09:35:53,275.45
>>
>> Then after storing it in HDFS, I read that file as below
>>
>> import org.apache.spark.sql.functions._
>> val location="hdfs://rhes75:9000/tmp/crap.csv"
>> val df1 = spark.read.option("header", false).csv(location)
>> case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
>> PRICE: Double)
>> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
>> p(2).toString,p(3).toString.toDouble))
>> df2.printSchema
>>
>> This is the result I get
>>
>> df1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2
>> more fields]
>> defined class columns
>> df2: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER: string
>> ... 2 more fields]
>> root
>>  |-- KEY: string (nullable = true)
>>  |-- TICKER: string (nullable = true)
>>  |-- TIMEISSUED: string (nullable = true)
>>  |-- PRICE: double (nullable = false)
>>
>> So in my case the only difference is that that comma separated line is
>> stored in a String as opposed to csv.
>>
>> How can I achieve this simple transformation?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 6 Sep 2018 at 03:38, Manu Zhang  wrote:
>>
>>> Have you tried adding Encoder for columns as suggested by Jungtaek Lim ?
>>>
>>> On Thu, Sep 6, 2018 at 6:24 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 I can rebuild the comma separated list as follows:


case class columns(KEY: String, TICKER: String, TIMEISSUE

Re: CBO not working for Parquet Files

2018-09-06 Thread emlyn
rajat mishra wrote
> When I try to computed the statistics for a query where partition column
> is in where clause, the statistics returned contains only the sizeInBytes
> and not the no of rows count.

We are also having the same issue. We have our data in partitioned parquet
files and were hoping to try out cbo but haven’t been able to get it
working: any query with a where clause on the partition column(s) (which is
the majority of realistic queries) seems to lose/ignore the rowCount stats.
We’ve generated both overall table stats (ANALYZE TABLE db.table PARTITION
COMPUTE STATISTICS;) and partitioned stats (ANALYZE TABLE db.table PARTITION
(col1, col2) COMPUTE STATISTICS;), and have verified that they are present
in the metastore.
 
I’ve also found this ticket:
https://issues.apache.org/jira/browse/SPARK-25185, but there it has no
response so far.
 
I suspect we must be missing something, as it seems that partitioned parquet
files would be a common use case, and if this is a bug in Spark I would have
expected it to have been picked up sooner.
 
Has anybody managed to get cbo working with partitioned parquet files? Is
this a known issue?
 
Thanks,
Emlyn



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to make pyspark use custom python?

2018-09-06 Thread Patrick McCarthy
It looks like for whatever reason your cluster isn't using the python you
distributed, or said distribution doesn't contain what you think.

I've used the following with success to deploy a conda environment to my
cluster at runtime:
https://henning.kropponline.de/2016/09/24/running-pyspark-with-conda-env/

On Thu, Sep 6, 2018 at 2:58 AM, Hyukjin Kwon  wrote:

> Are you doubly sure if it is an issue in Spark? I used custom python
> several times with setting it in PYSPARK_PYTHON before and it was no
> problem.
>
> 2018년 9월 6일 (목) 오후 2:21, mithril 님이 작성:
>
>> For better looking , please see
>> https://stackoverflow.com/questions/52178406/howto-make-
>> pyspark-use-custom-python
>> > pyspark-use-custom-python>
>>
>> --
>>
>>
>> I am using zeppelin connect remote spark cluster.
>>
>> remote spark is using system python 2.7 .
>>
>> I want to switch to miniconda3, install a lib pyarrow.
>> What I do is :
>>
>> 1. Download miniconda3, install some libs, scp miniconda3 folder to spark
>> master and slaves.
>> 2. adding `PYSPARK_PYTHON="/usr/local/miniconda3/bin/python"` to
>> `spark-env.sh` in spark master and slaves.
>> 3. restart spark and zeppelin
>> 4. Running code
>>
>> %spark.pyspark
>>
>> import pandas as pd
>> from pyspark.sql.functions import pandas_udf,PandasUDFType
>>
>>
>> @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
>> def process_order_items(pdf):
>>
>> pdf.loc[:, 'total_price'] = pdf['price'] * pdf['count']
>>
>> d = {'has_discount':'count',
>> 'clearance':'count',
>> 'count': ['count', 'sum'],
>> 'price_guide':'max',
>> 'total_price': 'sum'
>>
>> }
>>
>> pdf1 = pdf.groupby('day').agg(d)
>> pdf1.columns = pdf1.columns.map('_'.join)
>> d1 = {'has_discount_count':'discount_order_count',
>> 'clearance_count':'clearance_order_count',
>> 'count_count':'order_count',
>> 'count_sum':'sale_count',
>> 'price_guide_max':'price_guide',
>> 'total_price_sum': 'total_price'
>> }
>>
>> pdf2 = pdf1.rename(columns=d1)
>>
>> pdf2.loc[:, 'discount_sale_count'] =
>> pdf.loc[pdf.has_discount>0,
>> 'count'].resample(freq).sum()
>> pdf2.loc[:, 'clearance_sale_count'] = pdf.loc[pdf.clearance>0,
>> 'count'].resample(freq).sum()
>> pdf2.loc[:, 'price'] = pdf2.total_price / pdf2.sale_count
>>
>> pdf2 = pdf2.drop(pdf2[pdf2.order_count == 0].index)
>>
>> return pdf2
>>
>>
>> results = df.groupby("store_id",
>> "product_id").apply(process_order_items)
>>
>> results.select(['store_id', 'price']).show(5)
>>
>>
>> Got error :
>>
>> Py4JJavaError: An error occurred while calling o172.showString.
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in
>> stage 6.0 (TID 143, 10.104.33.18, executor 2):
>> org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):
>>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py",
>> line
>> 230, in main
>> process()
>>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py",
>> line
>> 225, in process
>> serializer.dump_stream(func(split_index, iterator), outfile)
>>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py",
>> line
>> 150, in 
>> func = lambda _, it: map(mapper, it)
>>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/
>> serializers.py",
>> line 276, in load_stream
>> import pyarrow as pa
>> ImportError: No module named pyarrow
>>
>>
>> `10.104.33.18` is spark master,  so I think the `PYSPARK_PYTHON` is not
>> set
>> correctly .
>>
>> `pyspark`
>>
>> I login to master and slaves, run `pyspark interpreter` in each, and found
>> `import pyarrow` do not throw exception .
>>
>>
>> PS: `pyarrow` also installed in the machine which running zeppelin.
>>
>> --
>>
>> More info:
>>
>>
>> 1. spark cluster is installed in A, B, C , zeppelin is installed in D.
>> 2. `PYSPARK_PYTHON` is set in `spark-env.sh` in each A, B, C
>> 3. `import pyarrow` is fine with `/usr/local/spark/bin/pyspark` in A, B
>> ,C /
>> 4. `import pyarrow` is fine on A, B ,C custom python(miniconda3)
>> 5. `import pyarrow` is fine on D's default python(miniconda3, path is
>> different with A, B ,C , but it is doesn't matter)
>>
>>
>>
>> So I completely coundn't understand why it doesn't work.
>>
>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: getting error: value toDF is not a member of Seq[columns]

2018-09-06 Thread Mich Talebzadeh
Ok somehow this worked!

 // Save prices to mongoDB collection
 val document = sparkContext.parallelize((1 to 1).
map(i =>
Document.parse(s"{key:'$key',ticker:'$ticker',timeissued:'$timeissued',price:$price,CURRENCY:'$CURRENCY',op_type:$op_type,op_time:'$op_time'}")))
 //
 // Writing document to Mongo collection
 //
 MongoSpark.save(document, writeConfig)

Note that all non numeric columns are enclosed with '$column'

I just created a dummy map with one single mapping (1 to 1)

These are the results in MongoDB document

{
"_id" : ObjectId("5b915796d3c6071e82fdca2b"),
"key" : "23c39917-08a9-4845-ba74-51997707d374",
"ticker" : "IBM",
"timeissued" : "2018-09-06T17:51:21",
"price" : 207.23,
"CURRENCY" : "GBP",
"op_type" : NumberInt(1),
"op_time" : "1536251798114"
}
{
"_id" : ObjectId("5b915796d3c6071e82fdca2c"),
"key" : "22f353f9-9b28-463c-9f1c-64213ded7cd5",
"ticker" : "TSCO",
"timeissued" : "2018-09-06T17:51:21",
"price" : 179.52,
"CURRENCY" : "GBP",
"op_type" : NumberInt(1),
"op_time" : "1536251798162"
}


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Sep 2018 at 10:24, Mich Talebzadeh 
wrote:

> thanks if you define columns class as below
>
>
> scala> case class columns(KEY: String, TICKER: String, TIMEISSUED:
> String, *PRICE: Double)*
> defined class columns
> scala> var df = Seq(columns("key", "ticker", "timeissued", 1.23f)).toDF
> df: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
> more fields]
> scala> df.printSchema
> root
>  |-- KEY: string (nullable = true)
>  |-- TICKER: string (nullable = true)
>  |-- TIMEISSUED: string (nullable = true)
>  |-- PRICE: double (nullable = false)
>
> looks better
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 6 Sep 2018 at 10:10, Jungtaek Lim  wrote:
>
>> This code works with Spark 2.3.0 via spark-shell.
>>
>> scala> case class columns(KEY: String, TICKER: String, TIMEISSUED:
>> String, PRICE: Float)
>> defined class columns
>>
>> scala> import spark.implicits._
>> import spark.implicits._
>>
>> scala> var df = Seq(columns("key", "ticker", "timeissued", 1.23f)).toDF
>> 18/09/06 18:02:23 WARN ObjectStore: Failed to get database global_temp,
>> returning NoSuchObjectException
>> df: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
>> more fields]
>>
>> scala> df
>> res0: org.apache.spark.sql.DataFrame = [KEY: string, TICKER: string ... 2
>> more fields]
>>
>> Maybe need to know about actual type of key, ticker, timeissued, price
>> from your variables.
>>
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 9월 6일 (목) 오후 5:57, Mich Talebzadeh 님이
>> 작성:
>>
>>> I am trying to understand why spark cannot convert a simple comma
>>> separated columns as DF.
>>>
>>> I did a test
>>>
>>> I took one line of print and stored it as a one liner csv file as below
>>>
>>> var allInOne = key+","+ticker+","+timeissued+","+price
>>> println(allInOne)
>>>
>>> cat crap.csv
>>> 6e84b11d-cb03-44c0-aab6-37e06e06c996,MRW,2018-09-06T09:35:53,275.45
>>>
>>> Then after storing it in HDFS, I read that file as below
>>>
>>> import org.apache.spark.sql.functions._
>>> val location="hdfs://rhes75:9000/tmp/crap.csv"
>>> val df1 = spark.read.option("header", false).csv(location)
>>> case class columns(KEY: String, TICKER: String, TIMEISSUED: String,
>>> PRICE: Double)
>>> val df2 = df1.map(p => columns(p(0).toString,p(1).toString,
>>> p(2).toString,p(3).toString.toDouble))
>>> df2.printSchema
>>>
>>> This is the result I get
>>>
>>> df1: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2
>>> more fields]
>>> defined class columns
>>> df2: org.apache.spark.sql.Dataset[columns] = [KEY: string, TICKER:
>>> string ... 2 more fields]
>>> root
>>>  |-- KEY: string (nullable = true)
>>>  |-- TICKER: string (nullabl

Error in show()

2018-09-06 Thread dimitris plakas
Hello everyone, I am new in Pyspark and i am facing an issue. Let me
explain what exactly is the problem.

I have a dataframe and i apply on this a map() function
(dataframe2=datframe1.rdd.map(custom_function())
dataframe = sqlContext.createDataframe(dataframe2)

when i have

dataframe.show(30,True) it shows the result,

when i am using dataframe.show(60, True) i get the error. The Error is in
the attachement Pyspark_Error.txt.

Could you please explain me what is this error and how to overpass it?
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
[Stage 13:===>(23 + 8) / 
47]2018-09-06 22:16:07 ERROR PythonRunner:91 - Python worker exited 
unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"C:\Users\dimitris\PycharmProjects\untitled\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py",
 line 214, in main
  File 
"C:\Users\dimitris\PycharmProjects\untitled\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py",
 line 685, in read_int
raise EOFError
EOFError

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:428)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.api.python.SerDeUtil$Auto

Re: How to make pyspark use custom python?

2018-09-06 Thread mithril


The whole content in `spark-env.sh` is 

```
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
-Dspark.deploy.zookeeper.url=10.104.85.78:2181,10.104.114.131:2181,10.135.2.132:2181
-Dspark.deploy.zookeeper.dir=/spark"
PYSPARK_PYTHON="/usr/local/miniconda3/bin/python"
```

I ran `/usr/local/spark/sbin/stop-all.sh`  and
`/usr/local/spark/sbin/start-all.sh` to restart spark cluster.

Anything wrong ??



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [External Sender] Re: How to make pyspark use custom python?

2018-09-06 Thread Femi Anthony
Are you sure that pyarrow is deployed on your slave hosts ? If not, you
will either have to get it installed or ship it along when you call
spark-submit by zipping it up and specifying the zipfile to be shipped
using the
--py-files zipfile.zip option

A quick check would be to ssh to a slave host, run pyspark and try to
import pyarrow.

Femi

On Thu, Sep 6, 2018 at 9:25 PM mithril  wrote:

>
> The whole content in `spark-env.sh` is
>
> ```
> SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
> -Dspark.deploy.zookeeper.url=10.104.85.78:2181,10.104.114.131:2181,
> 10.135.2.132:2181
> -Dspark.deploy.zookeeper.dir=/spark"
> PYSPARK_PYTHON="/usr/local/miniconda3/bin/python"
> ```
>
> I ran `/usr/local/spark/sbin/stop-all.sh`  and
> `/usr/local/spark/sbin/start-all.sh` to restart spark cluster.
>
> Anything wrong ??
>
>
>
> --
> Sent from:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_&d=DwICAg&c=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE&r=yGeUxkUZBNPLfjlLWOxq5_p1UIOy_S4ghJsg2_iDHFY&m=MukYKwEikKwBiW7D3pP5WDVQCs39Xo8dHytUwL1JjLM&s=5Bta_aRxRPJk58UXz-hQd7A1EzF-PX3A5C3vENHe3OQ&e=
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Error in show()

2018-09-06 Thread Apostolos N. Papadopoulos
Can you isolate the row that is causing the problem? I mean start using 
show(31) up to show(60).


Perhaps this will help you to understand the problem.

regards,

Apostolos



On 07/09/2018 01:11 πμ, dimitris plakas wrote:
Hello everyone, I am new in Pyspark and i am facing an issue. Let me 
explain what exactly is the problem.


I have a dataframe and i apply on this a map() function 
(dataframe2=datframe1.rdd.map(custom_function())

dataframe = sqlContext.createDataframe(dataframe2)

when i have

dataframe.show(30,True) it shows the result,

when i am using dataframe.show(60, True) i get the error. The Error is 
in the attachement Pyspark_Error.txt.


Could you please explain me what is this error and how to overpass it?



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol