Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
On Sun, Dec 18, 2016 at 2:26 AM, vaquar khan  wrote:

> select * from indexInfo;
>

Hi Vaquar
I do not see CF with the name indexInfo in any of the cassandra databases.

Thank
Deepak


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-17 Thread Divya Gehlot
I am not pyspark person ..
But from the errors I could figure out that your Spark application is
having memory issues .
Are you collecting the results to the driver at any point of time or have
configured less memory for the nodes ?
and If you are using Dataframes then there is  issue raised  in Jira


Hope this helps

Thanks,
Divya

On 16 December 2016 at 16:53, Russell Jurney 
wrote:

> I have created a PySpark Streaming application that uses Spark ML to
> classify flight delays into three categories: on-time, slightly late, very
> late. After an hour or so something times out and the whole thing crashes.
>
> The code and error are on a gist here: https://gist.github.com/rjurney/
> 17d471bc98fd1ec925c37d141017640d
>
> While I am interested in why I am getting an exception, I am more
> interested in understanding what the correct deployment model is... because
> long running processes will have new and varied errors and exceptions.
> Right now with what I've built, Spark is a highly dependable distributed
> system but in streaming mode the entire thing is dependent on one Python
> PID going down. This can't be how apps are deployed in the wild because it
> will never be very reliable, right? But I don't see anything about this in
> the docs, so I am confused.
>
> Note that I use this to run the app, maybe that is the problem?
>
> ssc.start()
> ssc.awaitTermination()
>
>
> What is the actual deployment model for Spark Streaming? All I know to do
> right now is to restart the PID. I'm new to Spark, and the docs don't
> really explain this (that I can see).
>
> Thanks!
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>


Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-17 Thread Russell Jurney
Anyone? This is for a book, so I need to figure this out.

On Fri, Dec 16, 2016 at 12:53 AM Russell Jurney 
wrote:

> I have created a PySpark Streaming application that uses Spark ML to
> classify flight delays into three categories: on-time, slightly late, very
> late. After an hour or so something times out and the whole thing crashes.
>
> The code and error are on a gist here:
> https://gist.github.com/rjurney/17d471bc98fd1ec925c37d141017640d
>
> While I am interested in why I am getting an exception, I am more
> interested in understanding what the correct deployment model is... because
> long running processes will have new and varied errors and exceptions.
> Right now with what I've built, Spark is a highly dependable distributed
> system but in streaming mode the entire thing is dependent on one Python
> PID going down. This can't be how apps are deployed in the wild because it
> will never be very reliable, right? But I don't see anything about this in
> the docs, so I am confused.
>
> Note that I use this to run the app, maybe that is the problem?
>
> ssc.start()
> ssc.awaitTermination()
>
>
> What is the actual deployment model for Spark Streaming? All I know to do
> right now is to restart the PID. I'm new to Spark, and the docs don't
> really explain this (that I can see).
>
> Thanks!
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>
>
>


Re: Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
 blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px 
#715FFA solid !important; padding-left:1ex !important; background-color:white 
!important; }  Super, that works! Thanks


Sent from Yahoo Mail for iPhone


On Sunday, December 18, 2016, 11:28 AM, Yong Zhang  wrote:

-- P {margin-top:0;margin-bottom:0;}
Why not you just return the struct you defined, instead of an array?




            @Override
            public Row call(Double x, Double y) throws Exception {
                Row row = RowFactory.create(x, y);
                return row;
            }



From: Richard Xin 
Sent: Saturday, December 17, 2016 8:53 PM
To: Yong Zhang; zjp_j...@163.com; user
Subject: Re: Java to show struct field from a Dataframe I tried to transform 
root
 |-- latitude: double (nullable = false)
 |-- longitude: double (nullable = false)
 |-- name: string (nullable = true)
to: 
root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
Code snippet is as followings:
        sqlContext.udf().register("toLocation", new UDF2() 
{
            @Override
            public Row call(Double x, Double y) throws Exception {
                Row row = RowFactory.create(new double[] { x, y });
                return row;
            }
        }, DataTypes.createStructType(new StructField[] { 
                new StructField("longitude", DataTypes.DoubleType, true, 
Metadata.empty()),
                new StructField("latitude", DataTypes.DoubleType, true, 
Metadata.empty())
            }));
        DataFrame transformedDf1 = citiesDF.withColumn("location",
                callUDF("toLocation", col("longitude"), col("latitude")));
        
transformedDf1.drop("latitude").drop("longitude").schema().printTreeString();  
// prints schema tree OK as expected
transformedDf.show();  // java.lang.ClassCastException: [D cannot be 
cast to java.lang.Double

seems to me that the ReturnType of the UDF2 might be the root cause. but not 
sure how to correct.
Thanks,Richard




On Sunday, December 18, 2016 7:15 AM, Yong Zhang  wrote:


"[D" type means a double array type. So this error simple means you have 
double[] data, but Spark needs to cast it to Double, as your schema defined.
The error message clearly indicates the data doesn't match with  the type 
specified in the schema.
I wonder how you are so sure about your data? Do you check it under other tool?
Yong

From: Richard Xin 
Sent: Saturday, December 17, 2016 10:56 AM
To: zjp_j...@163.com; user
Subject: Re: Java to show struct field from a Dataframe data is good


On Saturday, December 17, 2016 11:50 PM, "zjp_j...@163.com"  
wrote:


I think the causation is your invanlid Double data , have u checked your data ?
zjp_j...@163.com
 From: Richard XinDate: 2016-12-17 23:28To: UserSubject: Java to show struct 
field from a Dataframelet's say I have a DataFrame with schema of 
followings:root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
df.show(); throws following exception:

java.lang.ClassCastException: [D cannot be cast to java.lang.Double
    at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
    at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
    at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)

Any advise?Thanks in advance.Richard






 



Re: Java to show struct field from a Dataframe

2016-12-17 Thread Yong Zhang
Why not you just return the struct you defined, instead of an array?


@Override
public Row call(Double x, Double y) throws Exception {
Row row = RowFactory.create(x, y);
return row;
}



From: Richard Xin 
Sent: Saturday, December 17, 2016 8:53 PM
To: Yong Zhang; zjp_j...@163.com; user
Subject: Re: Java to show struct field from a Dataframe

I tried to transform
root
 |-- latitude: double (nullable = false)
 |-- longitude: double (nullable = false)
 |-- name: string (nullable = true)

to:
root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 ||-- longitude: double (nullable = true)
 ||-- latitude: double (nullable = true)

Code snippet is as followings:

sqlContext.udf().register("toLocation", new UDF2() 
{
@Override
public Row call(Double x, Double y) throws Exception {
Row row = RowFactory.create(new double[] { x, y });
return row;
}
}, DataTypes.createStructType(new StructField[] {
new StructField("longitude", DataTypes.DoubleType, true, 
Metadata.empty()),
new StructField("latitude", DataTypes.DoubleType, true, 
Metadata.empty())
}));

DataFrame transformedDf1 = citiesDF.withColumn("location",
callUDF("toLocation", col("longitude"), col("latitude")));

transformedDf1.drop("latitude").drop("longitude").schema().printTreeString();  
// prints schema tree OK as expected

transformedDf.show();  // java.lang.ClassCastException: [D cannot be 
cast to java.lang.Double


seems to me that the ReturnType of the UDF2 might be the root cause. but not 
sure how to correct.

Thanks,
Richard




On Sunday, December 18, 2016 7:15 AM, Yong Zhang  wrote:


"[D" type means a double array type. So this error simple means you have 
double[] data, but Spark needs to cast it to Double, as your schema defined.

The error message clearly indicates the data doesn't match with  the type 
specified in the schema.

I wonder how you are so sure about your data? Do you check it under other tool?

Yong



From: Richard Xin 
Sent: Saturday, December 17, 2016 10:56 AM
To: zjp_j...@163.com; user
Subject: Re: Java to show struct field from a Dataframe

data is good


On Saturday, December 17, 2016 11:50 PM, "zjp_j...@163.com"  
wrote:


I think the causation is your invanlid Double data , have u checked your data ?


zjp_j...@163.com

From: Richard Xin
Date: 2016-12-17 23:28
To: User
Subject: Java to show struct field from a Dataframe
let's say I have a DataFrame with schema of followings:
root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 ||-- longitude: double (nullable = true)
 ||-- latitude: double (nullable = true)

df.show(); throws following exception:

java.lang.ClassCastException: [D cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)


Any advise?
Thanks in advance.
Richard






Kafka Spark structured streaming latency benchmark.

2016-12-17 Thread Prashant Sharma
Hi,

Goal of my benchmark is to arrive at end to end latency lower than 100ms
and sustain them over time, by consuming from a kafka topic and writing
back to another kafka topic using Spark. Since the job does not do
aggregation and does a constant time processing on each message, it
appeared to me as an achievable target. But, then there are some surprising
and interesting pattern to observe.

 Basically, it has four components namely,
1) kafka
2) Long running kafka producer, rate limited to 1000 msgs/sec, with each
message of about 1KB.
3) Spark  job subscribed to `test` topic and writes out to another topic
`output`.
4) A Kafka consumer, reading from the `output` topic.

How the latency was measured ?

While sending messages from kafka producer, each message is embedded the
timestamp at which it is pushed to the kafka `test` topic. Spark receives
each message and writes them out to `output` topic as is. When these
messages arrive at Kafka consumer, their embedded time is subtracted from
the time of arrival at the consumer and a scatter plot of the same is
attached.

The scatter plots sample only 10 minutes of data received during initial
one hour and then again 10 minutes of data received after 2 hours of run.



These plots indicate a significant slowdown in latency, in the later
scatter plot indicate almost all the messages were received with a delay
larger than 2 seconds. However, first plot show that most messages arrived
in less than 100ms latency. The two samples were taken with time difference
of 2 hours approx.

After running the test for 24 hours, the jstat

and jmap

output
for the jobs indicate possibility  of memory constrains. To be more clear,
job was run with local[20] and memory of 5GB(spark.driver.memory). The job
is straight forward and located here: https://github.com/ScrapCodes/
KafkaProducer/blob/master/src/main/scala/com/github/scrapcod
es/kafka/SparkSQLKafkaConsumer.scala .


What is causing the gradual slowdown? I need help in diagnosing the
problem.

Thanks,

--Prashant


Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
There are 8 worker nodes in the cluster .

Thanks
Deepak

On Dec 18, 2016 2:15 AM, "Holden Karau"  wrote:

> How many workers are in the cluster?
>
> On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma 
> wrote:
>
>> Hi All,
>> I am iterating over data frame's paritions using df.foreachPartition .
>> Upon each iteration of row , i am initializing DAO to insert the row into
>> cassandra.
>> Each of these iteration takes almost 1 and half minute to finish.
>> In my workflow , this is part of an action and 100 partitions are being
>> created for the df as i can see 100 tasks being created , where the insert
>> dao operation is being performed.
>> Since each of these 100 tasks , takes around 1 and half minute to
>> complete , it takes around 2 hour for this small insert operation.
>> Is anyone facing the same scenario and is there any time efficient way to
>> handle this?
>> This latency is not good in out use case.
>> Any pointer to improve/minimise the latency will be really appreciated.
>>
>>
>> --
>> Thanks
>> Deepak
>>
>>
>>


Re: Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
I tried to transform 
root
 |-- latitude: double (nullable = false)
 |-- longitude: double (nullable = false)
 |-- name: string (nullable = true)
to: 
root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
Code snippet is as followings:
        sqlContext.udf().register("toLocation", new UDF2() 
{
            @Override
            public Row call(Double x, Double y) throws Exception {
                Row row = RowFactory.create(new double[] { x, y });
                return row;
            }
        }, DataTypes.createStructType(new StructField[] { 
                new StructField("longitude", DataTypes.DoubleType, true, 
Metadata.empty()),
                new StructField("latitude", DataTypes.DoubleType, true, 
Metadata.empty()) 
            }));
        DataFrame transformedDf1 = citiesDF.withColumn("location",
                callUDF("toLocation", col("longitude"), col("latitude")));
        
transformedDf1.drop("latitude").drop("longitude").schema().printTreeString();  
// prints schema tree OK as expected
transformedDf.show();  // java.lang.ClassCastException: [D cannot be 
cast to java.lang.Double

seems to me that the ReturnType of the UDF2 might be the root cause. but not 
sure how to correct.
Thanks,Richard


 

On Sunday, December 18, 2016 7:15 AM, Yong Zhang  
wrote:
 

 #yiv1972361746 #yiv1972361746 -- P 
{margin-top:0;margin-bottom:0;}#yiv1972361746 "[D" type means a double array 
type. So this error simple means you have double[] data, but Spark needs to 
cast it to Double, as your schema defined.
The error message clearly indicates the data doesn't match with  the type 
specified in the schema.
I wonder how you are so sure about your data? Do you check it under other tool?
Yong

From: Richard Xin 
Sent: Saturday, December 17, 2016 10:56 AM
To: zjp_j...@163.com; user
Subject: Re: Java to show struct field from a Dataframe data is good


On Saturday, December 17, 2016 11:50 PM, "zjp_j...@163.com"  
wrote:


I think the causation is your invanlid Double data , have u checked your data ?
zjp_j...@163.com
 From: Richard XinDate: 2016-12-17 23:28To: UserSubject: Java to show struct 
field from a Dataframelet's say I have a DataFrame with schema of 
followings:root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
df.show(); throws following exception:

java.lang.ClassCastException: [D cannot be cast to java.lang.Double
    at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
    at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
    at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)

Any advise?Thanks in advance.Richard





   

Re: Running Multiple Versions of Spark on the same cluster (YARN)

2016-12-17 Thread Koert Kuipers
spark only needs to be present on the machine that launches it using
spark-submit

On Sat, Dec 17, 2016 at 3:59 PM, Jorge Machado  wrote:

> Hi Tiago,
>
> thx for the update. Lat question : but this spark-submit that you are
> using need to be on the same version on all yarn hosts ?
> Regards
>
> Jorge Machado
>
>
>
>
>
> On 17 Dec 2016, at 16:46, Tiago Albineli Motta  wrote:
>
> Hi Jorge,
>
> Here we are using an apache hadoop instalation, and to run multiple
> versions we just need to change the submit in the client using the correct
> spark version you need.
>
> $SPARK_HOME/bin/spark-submit
>
> and pass the correct Spark libs in the conf.
>
> For spark 2.0.0
>
> --conf spark.yarn.archive=
>
> For previously versions
>
> --conf spark.yarn.jar=
>
>
>
> Tiago
>
>
>
> Tiago Albineli Motta
> Desenvolvedor de Software - Globo.com
> ICQ: 32107100
> http://programandosemcafeina.blogspot.com
>
> On Sat, Dec 17, 2016 at 5:36 AM, Jorge Machado  wrote:
>
>> Hi Everyone,
>>
>> I have one question : is it possible to run like on HDP Spark 1.6.1 and
>> then run Spark 2.0.0 inside of it ?
>> Like passing the spark libs with —jars ? The Ideia behind it is not to
>> need to use the default installation of HDP and be able to deploy new
>> versions of spark quickly.
>>
>> Thx
>>
>> Jorge Machado
>>
>>
>>
>>
>>
>>
>>
>
>


Re: Java to show struct field from a Dataframe

2016-12-17 Thread Yong Zhang
"[D" type means a double array type. So this error simple means you have 
double[] data, but Spark needs to cast it to Double, as your schema defined.


The error message clearly indicates the data doesn't match with  the type 
specified in the schema.


I wonder how you are so sure about your data? Do you check it under other tool?


Yong



From: Richard Xin 
Sent: Saturday, December 17, 2016 10:56 AM
To: zjp_j...@163.com; user
Subject: Re: Java to show struct field from a Dataframe

data is good


On Saturday, December 17, 2016 11:50 PM, "zjp_j...@163.com"  
wrote:


I think the causation is your invanlid Double data , have u checked your data ?


zjp_j...@163.com

From: Richard Xin
Date: 2016-12-17 23:28
To: User
Subject: Java to show struct field from a Dataframe
let's say I have a DataFrame with schema of followings:
root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 ||-- longitude: double (nullable = true)
 ||-- latitude: double (nullable = true)

df.show(); throws following exception:

java.lang.ClassCastException: [D cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)


Any advise?
Thanks in advance.
Richard




Re: Regarding Connection Problem

2016-12-17 Thread Luciano Resende
On Fri, Dec 16, 2016 at 7:01 PM, Chintan Bhatt <
chintanbhatt...@charusat.ac.in> wrote:

Hi
> I want to give continuous output (avg. temperature) generated from node.js
> to store on Hadoop and then retrieve it for visualization.
> please guide me how to give continuous output of node.js to kafka.
>
> --
> CHINTAN BHATT 
>
>
I haven't tried with plain kafka, but if you are using confluent
distribution it provides a REST API on top of kafka.

Also, if you have the flexibility to use MQTT, then you can see an example
here
https://github.com/lresende/bahir-iot-demo


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Running Multiple Versions of Spark on the same cluster (YARN)

2016-12-17 Thread Jorge Machado
Hi Tiago, 

thx for the update. Lat question : but this spark-submit that you are using 
need to be on the same version on all yarn hosts ?  
Regards

Jorge Machado





> On 17 Dec 2016, at 16:46, Tiago Albineli Motta  wrote:
> 
> Hi Jorge,
> 
> Here we are using an apache hadoop instalation, and to run multiple versions 
> we just need to change the submit in the client using the correct spark 
> version you need. 
> 
> $SPARK_HOME/bin/spark-submit
> 
> and pass the correct Spark libs in the conf.
> 
> For spark 2.0.0
> 
> --conf spark.yarn.archive=
> 
> For previously versions
> 
> --conf spark.yarn.jar=
> 
> 
> 
> Tiago
> 
> 
> 
> Tiago Albineli Motta
> Desenvolvedor de Software - Globo.com
> ICQ: 32107100
> http://programandosemcafeina.blogspot.com 
> 
> 
> On Sat, Dec 17, 2016 at 5:36 AM, Jorge Machado  > wrote:
> Hi Everyone, 
> 
> I have one question : is it possible to run like on HDP Spark 1.6.1 and then 
> run Spark 2.0.0 inside of it ? 
> Like passing the spark libs with —jars ? The Ideia behind it is not to need 
> to use the default installation of HDP and be able to deploy new versions of 
> spark quickly. 
> 
> Thx
> 
> Jorge Machado
> 
> 
> 
> 
> 
> 
> 



Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread vaquar khan
Hi Deepak,

Could you share Index information in your database.

select * from indexInfo;


Regards,
Vaquar khan

On Sat, Dec 17, 2016 at 2:45 PM, Holden Karau  wrote:

> How many workers are in the cluster?
>
> On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma 
> wrote:
>
>> Hi All,
>> I am iterating over data frame's paritions using df.foreachPartition .
>> Upon each iteration of row , i am initializing DAO to insert the row into
>> cassandra.
>> Each of these iteration takes almost 1 and half minute to finish.
>> In my workflow , this is part of an action and 100 partitions are being
>> created for the df as i can see 100 tasks being created , where the insert
>> dao operation is being performed.
>> Since each of these 100 tasks , takes around 1 and half minute to
>> complete , it takes around 2 hour for this small insert operation.
>> Is anyone facing the same scenario and is there any time efficient way to
>> handle this?
>> This latency is not good in out use case.
>> Any pointer to improve/minimise the latency will be really appreciated.
>>
>>
>> --
>> Thanks
>> Deepak
>>
>>
>>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Holden Karau
How many workers are in the cluster?

On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma 
wrote:

> Hi All,
> I am iterating over data frame's paritions using df.foreachPartition .
> Upon each iteration of row , i am initializing DAO to insert the row into
> cassandra.
> Each of these iteration takes almost 1 and half minute to finish.
> In my workflow , this is part of an action and 100 partitions are being
> created for the df as i can see 100 tasks being created , where the insert
> dao operation is being performed.
> Since each of these 100 tasks , takes around 1 and half minute to complete
> , it takes around 2 hour for this small insert operation.
> Is anyone facing the same scenario and is there any time efficient way to
> handle this?
> This latency is not good in out use case.
> Any pointer to improve/minimise the latency will be really appreciated.
>
>
> --
> Thanks
> Deepak
>
>
>


foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
Hi All,
I am iterating over data frame's paritions using df.foreachPartition .
Upon each iteration of row , i am initializing DAO to insert the row into
cassandra.
Each of these iteration takes almost 1 and half minute to finish.
In my workflow , this is part of an action and 100 partitions are being
created for the df as i can see 100 tasks being created , where the insert
dao operation is being performed.
Since each of these 100 tasks , takes around 1 and half minute to complete
, it takes around 2 hour for this small insert operation.
Is anyone facing the same scenario and is there any time efficient way to
handle this?
This latency is not good in out use case.
Any pointer to improve/minimise the latency will be really appreciated.


-- 
Thanks
Deepak


Re: Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
data is good
 

On Saturday, December 17, 2016 11:50 PM, "zjp_j...@163.com" 
 wrote:
 

 #yiv7434848277 body {line-height:1.5;}#yiv7434848277 blockquote 
{margin-top:0px;margin-bottom:0px;margin-left:0.5em;}#yiv7434848277 
div.yiv7434848277foxdiv20161217234614718397 {}#yiv7434848277 body 
{font-size:10.5pt;color:rgb(0, 0, 0);line-height:1.5;}I think the causation is 
your invanlid Double data , have u checked your data ?
zjp_j...@163.com
 From: Richard XinDate: 2016-12-17 23:28To: UserSubject: Java to show struct 
field from a Dataframelet's say I have a DataFrame with schema of 
followings:root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
df.show(); throws following exception:

java.lang.ClassCastException: [D cannot be cast to java.lang.Double
    at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
    at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
    at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)

Any advise?Thanks in advance.Richard



   

Re: Java to show struct field from a Dataframe

2016-12-17 Thread zjp_j...@163.com
I think the causation is your invanlid Double data , have u checked your data ?



zjp_j...@163.com
 
From: Richard Xin
Date: 2016-12-17 23:28
To: User
Subject: Java to show struct field from a Dataframe
let's say I have a DataFrame with schema of followings:
root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 ||-- longitude: double (nullable = true)
 ||-- latitude: double (nullable = true)

df.show(); throws following exception:

java.lang.ClassCastException: [D cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)


Any advise?
Thanks in advance.
Richard


Re: Running Multiple Versions of Spark on the same cluster (YARN)

2016-12-17 Thread Tiago Albineli Motta
Hi Jorge,

Here we are using an apache hadoop instalation, and to run multiple
versions we just need to change the submit in the client using the correct
spark version you need.

$SPARK_HOME/bin/spark-submit

and pass the correct Spark libs in the conf.

For spark 2.0.0

--conf spark.yarn.archive=

For previously versions

--conf spark.yarn.jar=



Tiago



Tiago Albineli Motta
Desenvolvedor de Software - Globo.com
ICQ: 32107100
http://programandosemcafeina.blogspot.com

On Sat, Dec 17, 2016 at 5:36 AM, Jorge Machado  wrote:

> Hi Everyone,
>
> I have one question : is it possible to run like on HDP Spark 1.6.1 and
> then run Spark 2.0.0 inside of it ?
> Like passing the spark libs with —jars ? The Ideia behind it is not to
> need to use the default installation of HDP and be able to deploy new
> versions of spark quickly.
>
> Thx
>
> Jorge Machado
>
>
>
>
>
>
>


Java to show struct field from a Dataframe

2016-12-17 Thread Richard Xin
let's say I have a DataFrame with schema of followings:root
 |-- name: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
df.show(); throws following exception:

java.lang.ClassCastException: [D cannot be cast to java.lang.Double
    at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
    at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
    at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getDouble(rows.scala:221)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)

Any advise?Thanks in advance.Richard


The spark hive udf can read broadcast the variables?

2016-12-17 Thread 李斌松
The spark hive udf can read broadcast the variables?


theory question

2016-12-17 Thread kant kodali
Given a set of transformations does spark create multiple DAG's and picks
the DAG by some metric such as say higher degree of concurrency or
something else like the typical task graph model in parallel computing
suggests? or does it simply builds one simple DAG by going through
transformations/tasks in order?

I am thinking during shuffle phase there can be many ways to shuffle
therefore can result in multiple DAG's and ultimately some DAG will be
chosen by some metric however This is a complete guess I am not sure what
exactly happens underneath spark.

thanks!


Fwd: SparkLauncher does not return State/ID on a standalone cluster

2016-12-17 Thread Rahul Raj
I am unable to retrieve the state and Id of a submitted application on a
Standalone cluster. The job gets executed successfully on the cluster.

The state was checked using:

while(!handle.getState().isFinal()){
   //print handle.getState()
}

When run as local, state gets reported correctly.

Regards,
Rahul

-- 
 This email and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom it is 
addressed. If you are not the named addressee then you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately and delete this e-mail from your system.


Re: Dataset encoders for further types?

2016-12-17 Thread Michal Šenkýř
I actually already made a pull request adding support for arbitrary 
sequence types.


https://github.com/apache/spark/pull/16240

There is still a little problem of Seq.toDS not working for those types 
(couldn't get implicits with multiple type parameters to resolve 
correctly) but createDataset works fine.


Would be glad if you bring some attention to it. It's my first 
code-related pull request and noone responded to it yet. I'm wondering 
if I'm doing something wrong on that front.


Michal Senkyr


On 16.12.2016 12:04, Jakub Dubovsky wrote:

I will give that a try. Thanks!

On Fri, Dec 16, 2016 at 12:45 AM, Michael Armbrust 
> wrote:


I would have sworn there was a ticket, but I can't find it.  So
here you go: https://issues.apache.org/jira/browse/SPARK-18891


A work around until that is fixed would be for you to manually
specify the kryo encoder

.

On Thu, Dec 15, 2016 at 8:18 AM, Jakub Dubovsky
> wrote:

Hey,

I want to ask whether there is any roadmap/plan for adding
Encoders for further types in next releases of Spark. Here is
a list

 of
currently supported types. We would like to use Datasets with
our internally defined case classes containing
scala.collection.immutable.List(s). This does not work now
because these lists are converted to ArrayType (Seq). This
then fails a constructor lookup because of seq-is-not-a-list
error...

This means that for now we are stuck with using RDDs.

Thanks for any insights!

Jakub Dubovsky







How to perform Join operation using JAVARDD

2016-12-17 Thread Sree Eedupuganti
I tried like this,

*CrashData_1.csv:*

*CRASH_KEYCRASH_NUMBER  CRASH_DATECRASH_MONTH*
*2016899114 2016899114  01/02/2016   12:00:00
AM +*

*CrashData_2.csv:*

*CITY_NAMEZIPCODE CITY STATE*
*1945 704   PARC PARQUE   PR*


Code:

*JavaRDD firstRDD =
sc.textFile("/Users/apple/Desktop/CrashData_1.csv");*

*JavaRDD secondRDD =
sc.textFile("/Users/apple/Desktop/CrashData_2.csv");*

*JavaRDD allRDD = firstRDD.union(secondRDD);*


*Output i am getting:*

*[CRASH_KEY,CRASH_NUMBER,CRASH_DATE,CRASH_MONTH,
2016899114,2016899114,01/02/2016 12:00:00 AM + *

*CITY_NAME,ZIPCODE,CITY,STATE, **1945,704,PARC PARQUE,PR]*




*Any suggesttions please, Thanks in advance*


Re: need help to have a Java version of this scala script

2016-12-17 Thread Richard Xin
thanks for pointing to the right direction, I have figured out the way.
 

On Saturday, December 17, 2016 5:23 PM, Igor Berman  
wrote:
 

 do you mind to show what you have in java?in general $"bla" is col("bla") as 
soon as you import appropriate functionimport static 
org.apache.spark.sql.functions.callUDF;import static 
org.apache.spark.sql.functions.col;
udf should be callUDF e.g.ds.withColumn("localMonth", callUDF("toLocalMonth", 
col("unixTs"), col("tz")))
On 17 December 2016 at 09:54, Richard Xin  
wrote:

what I am trying to do:I need to add column (could be complicated 
transformation based on value of a column) to a give dataframe.
scala script:val hContext = new HiveContext(sc)
import hContext.implicits._
val df = hContext.sql("select x,y,cluster_no from test.dc")
val len = udf((str: String) => str.length)
val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3}
val df1 = df.withColumn("name-len", len($"x"))
val df2 = df1.withColumn("twice", twice($"cluster_no"))
val df3 = df2.withColumn("triple", triple($"cluster_no"))
The scala script above seems to work ok, but I am having trouble to do it Java 
way (note that transformation based on value of a column could be complicated, 
not limited to simple add/minus etc.). is there a way in java? Thanks.




   

Re: need help to have a Java version of this scala script

2016-12-17 Thread Igor Berman
do you mind to show what you have in java?
in general $"bla" is col("bla") as soon as you import appropriate function
import static org.apache.spark.sql.functions.callUDF;
import static org.apache.spark.sql.functions.col;
udf should be callUDF e.g.
ds.withColumn("localMonth", callUDF("toLocalMonth", col("unixTs"),
col("tz")))

On 17 December 2016 at 09:54, Richard Xin 
wrote:

> what I am trying to do:
> I need to add column (could be complicated transformation based on value
> of a column) to a give dataframe.
>
> scala script:
> val hContext = new HiveContext(sc)
> import hContext.implicits._
> val df = hContext.sql("select x,y,cluster_no from test.dc")
> val len = udf((str: String) => str.length)
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3}
> val df1 = df.withColumn("name-len", len($"x"))
> val df2 = df1.withColumn("twice", twice($"cluster_no"))
> val df3 = df2.withColumn("triple", triple($"cluster_no"))
>
> The scala script above seems to work ok, but I am having trouble to do it
> Java way (note that transformation based on value of a column could be
> complicated, not limited to simple add/minus etc.). is there a way in java?
> Thanks.
>