How to do this pairing in Spark?

2016-08-25 Thread Rex X
1. Given following CSV file

> $cat data.csv
>
> ID,City,Zip,Flag
> 1,A,95126,0
> 2,A,95126,1
> 3,A,95126,1
> 4,B,95124,0
> 5,B,95124,1
> 6,C,95124,0
> 7,C,95127,1
> 8,C,95127,0
> 9,C,95127,1


(a) where "ID" above is a primary key (unique),

(b) for each "City" and "Zip" combination, there is one ID in max with
Flag=0; while it can contain multiple IDs with Flag=1 for each "City" and
"Zip" combination.

(c) Flag can be 0 or 1


2. For each ID with Flag=0, we want to pair it with another ID with Flag=1
but with the same City - Zip. If one cannot find another paired ID with
Flag=1 and matched City - Zip, we just delete that record.

Here is the expected result:

> ID,City,Zip,Flag
> 1,A,95126,0
> 2,A,95126,1
> 4,B,95124,0
> 5,B,95124,1
> 7,C,95127,1
> 8,C,95127,0


Any valuable tips how to do this pairing in Python or Scala?

Great thanks!

Rex


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

ZFS linux port has got very stable these days given LLNL maintains the linux
port and they also use it as a FileSystem for their super computer (The
supercomputer is one of the top in the nation is what I heard)





On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote:
How about using ZFS?





On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote:
That's often not as important as you might think. It really only affects the
loading of data by the first Stage. Subsequent Stages (in the same Job or even
in other Jobs if you do it right) will use the map outputs, and will do so with
good data locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote:
At the core of it map reduce relies heavily on data locality. You would lose the
ability to process data closest to where it resides if you do not use hdfs.
S3 or NFS will not able to provide that.

On 26 Aug 2016 07:49, "kant kodali" < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira /browse/MESOS-3797





On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: https://issues.apache.org/jira 
/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at 
checkpointing https://spark.ap ache.org/docs/1.6.2/streaming- 
programming-guide.html#checkpo
inting For example, for Spark Streaming, you can not do any window operation in 
a
cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/prof ile/view?id=AAEWh2gBxianrb
Jd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpres s.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

How about using ZFS?





On Thu, Aug 25, 2016 3:48 PM, Mark Hamstra m...@clearstorydata.com wrote:
That's often not as important as you might think. It really only affects the
loading of data by the first Stage. Subsequent Stages (in the same Job or even
in other Jobs if you do it right) will use the map outputs, and will do so with
good data locality.
On Thu, Aug 25, 2016 at 3:36 PM, ayan guha < guha.a...@gmail.com > wrote:
At the core of it map reduce relies heavily on data locality. You would lose the
ability to process data closest to where it resides if you do not use hdfs.
S3 or NFS will not able to provide that.

On 26 Aug 2016 07:49, "kant kodali" < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira /browse/MESOS-3797





On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: https://issues.apache.org/jira 
/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at 
checkpointing https://spark.ap ache.org/docs/1.6.2/streaming- 
programming-guide.html#checkpo
inting For example, for Spark Streaming, you can not do any window operation in 
a
cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/prof ile/view?id=AAEWh2gBxianrb
Jd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpres s.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more typical
fashion with Spark using the now-instantiated distributed data.
But I don't get the sense that this meta-programming-ish style is really what
the OP was aiming at.
On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen < so...@cloudera.com > wrote:
Without a distributed storage system, 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
That's often not as important as you might think.  It really only affects
the loading of data by the first Stage.  Subsequent Stages (in the same Job
or even in other Jobs if you do it right) will use the map outputs, and
will do so with good data locality.

On Thu, Aug 25, 2016 at 3:36 PM, ayan guha  wrote:

> At the core of it map reduce relies heavily on data locality. You would
> lose the ability to process data closest to where it resides if you do not
> use hdfs.
> S3 or NFS will not able to provide that.
> On 26 Aug 2016 07:49, "kant kodali"  wrote:
>
>> yeah so its seems like its work in progress. At very least Mesos took the
>> initiative to provide alternatives to ZK. I am just really looking forward
>> for this.
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>>
>>
>> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
>> wrote:
>>
>>> Mesos also uses ZK for leader election.  There seems to be some effort
>>> in supporting etcd, but it's in progress: https://issues.apache.org/jira
>>> /browse/MESOS-1806
>>>
>>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:
>>>
>>> @Ofir @Sean very good points.
>>>
>>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>>> many things but for our use case all we need is for high availability and
>>> given the devops people frustrations here in our company who had extensive
>>> experience managing large clusters in the past we would be very happy to
>>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>>> through etcd and consul and if that is true I will be left with the
>>> following stack
>>>
>>> Spark + Mesos scheduler + Distributed File System or to be precise I
>>> should say Distributed Storage since S3 is an object store so I guess this
>>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>>> I set all this up
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>>>
>>> Just to add one concrete example regarding HDFS dependency.
>>> Have a look at checkpointing https://spark.ap
>>> ache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
>>> For example, for Spark Streaming, you can not do any window operation in
>>> a cluster without checkpointing to HDFS (or S3).
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi Kant,
>>>
>>> I trust the following would be of use.
>>>
>>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at
>>> it.
>>>
>>> In the heart of it and with reference to points you raised about HDFS,
>>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>>> Data is all about horizontal scaling with master and nodes (as opposed to
>>> vertical scaling like SQL Server running on a Host). and distributed data
>>> (by default data is replicated three times on different nodes for
>>> scalability and availability).
>>>
>>> Other members including Sean provided the limits on how far one operate
>>> Spark in its own space. If you are going to deal with data (data in motion
>>> and data at rest), then you will need to interact with some form of storage
>>> and HDFS and compatible file systems like S3 are the natural choices.
>>>
>>> Zookeeper is not just about high availability. It is used in Spark
>>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>>> a distributed locking system.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 25 August 2016 at 20:52, Mark Hamstra 
>>> wrote:
>>>
>>> s/playing a role/paying a role/
>>>
>>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>>> wrote:
>>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
> You would lose the ability to process data closest to where it resides if
you do not use hdfs.

This isn't true.  Many other data sources (e.g. Cassandra) support locality.

On Thu, Aug 25, 2016 at 3:36 PM, ayan guha  wrote:

> At the core of it map reduce relies heavily on data locality. You would
> lose the ability to process data closest to where it resides if you do not
> use hdfs.
> S3 or NFS will not able to provide that.
> On 26 Aug 2016 07:49, "kant kodali"  wrote:
>
>> yeah so its seems like its work in progress. At very least Mesos took the
>> initiative to provide alternatives to ZK. I am just really looking forward
>> for this.
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>>
>>
>> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
>> wrote:
>>
>>> Mesos also uses ZK for leader election.  There seems to be some effort
>>> in supporting etcd, but it's in progress: https://issues.apache.org/jira
>>> /browse/MESOS-1806
>>>
>>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:
>>>
>>> @Ofir @Sean very good points.
>>>
>>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>>> many things but for our use case all we need is for high availability and
>>> given the devops people frustrations here in our company who had extensive
>>> experience managing large clusters in the past we would be very happy to
>>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>>> through etcd and consul and if that is true I will be left with the
>>> following stack
>>>
>>> Spark + Mesos scheduler + Distributed File System or to be precise I
>>> should say Distributed Storage since S3 is an object store so I guess this
>>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>>> I set all this up
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>>>
>>> Just to add one concrete example regarding HDFS dependency.
>>> Have a look at checkpointing https://spark.ap
>>> ache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
>>> For example, for Spark Streaming, you can not do any window operation in
>>> a cluster without checkpointing to HDFS (or S3).
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi Kant,
>>>
>>> I trust the following would be of use.
>>>
>>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at
>>> it.
>>>
>>> In the heart of it and with reference to points you raised about HDFS,
>>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>>> Data is all about horizontal scaling with master and nodes (as opposed to
>>> vertical scaling like SQL Server running on a Host). and distributed data
>>> (by default data is replicated three times on different nodes for
>>> scalability and availability).
>>>
>>> Other members including Sean provided the limits on how far one operate
>>> Spark in its own space. If you are going to deal with data (data in motion
>>> and data at rest), then you will need to interact with some form of storage
>>> and HDFS and compatible file systems like S3 are the natural choices.
>>>
>>> Zookeeper is not just about high availability. It is used in Spark
>>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>>> a distributed locking system.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 25 August 2016 at 20:52, Mark Hamstra 
>>> wrote:
>>>
>>> s/playing a role/paying a role/
>>>
>>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>>> wrote:
>>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> Worker node that generates the actual data on the Workers.  After that, you
>>> proceed 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread ayan guha
At the core of it map reduce relies heavily on data locality. You would
lose the ability to process data closest to where it resides if you do not
use hdfs.
S3 or NFS will not able to provide that.
On 26 Aug 2016 07:49, "kant kodali"  wrote:

> yeah so its seems like its work in progress. At very least Mesos took the
> initiative to provide alternatives to ZK. I am just really looking forward
> for this.
>
> https://issues.apache.org/jira/browse/MESOS-3797
>
>
>
> On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io
> wrote:
>
>> Mesos also uses ZK for leader election.  There seems to be some effort in
>> supporting etcd, but it's in progress: https://issues.apache.org/
>> jira/browse/MESOS-1806
>>
>> On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:
>>
>> @Ofir @Sean very good points.
>>
>> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
>> many things but for our use case all we need is for high availability and
>> given the devops people frustrations here in our company who had extensive
>> experience managing large clusters in the past we would be very happy to
>> avoid Zookeeper. I also heard that Mesos can provide High Availability
>> through etcd and consul and if that is true I will be left with the
>> following stack
>>
>> Spark + Mesos scheduler + Distributed File System or to be precise I
>> should say Distributed Storage since S3 is an object store so I guess this
>> will be HDFS for us + etcd & consul. Now the big question for me is how do
>> I set all this up
>>
>>
>>
>> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>>
>> Just to add one concrete example regarding HDFS dependency.
>> Have a look at checkpointing https://spark.ap
>> ache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
>> For example, for Spark Streaming, you can not do any window operation in
>> a cluster without checkpointing to HDFS (or S3).
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi Kant,
>>
>> I trust the following would be of use.
>>
>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>>
>> In the heart of it and with reference to points you raised about HDFS,
>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>> Data is all about horizontal scaling with master and nodes (as opposed to
>> vertical scaling like SQL Server running on a Host). and distributed data
>> (by default data is replicated three times on different nodes for
>> scalability and availability).
>>
>> Other members including Sean provided the limits on how far one operate
>> Spark in its own space. If you are going to deal with data (data in motion
>> and data at rest), then you will need to interact with some form of storage
>> and HDFS and compatible file systems like S3 are the natural choices.
>>
>> Zookeeper is not just about high availability. It is used in Spark
>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>> a distributed locking system.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 20:52, Mark Hamstra  wrote:
>>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>> wrote:
>>
>> One way you can start to make this make more sense, Sean, is if you
>> exploit the code/data duality so that the non-distributed data that you are
>> sending out from the driver is actually paying a role more like code (or at
>> least parameters.)  What is sent from the driver to an Executer is then
>> used (typically as seeds or parameters) to execute some procedure on the
>> Worker node that generates the actual data on the Workers.  After that, you
>> proceed to execute in a more typical fashion with Spark using the
>> now-instantiated distributed data.
>>
>> But I don't get the sense that this meta-programming-ish style is really
>> what the OP was aiming at.
>>
>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>
>> Without a distributed storage system, your application can only create
>> data on the 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira/browse/MESOS-3797





On Thu, Aug 25, 2016 2:00 PM, Michael Gummelt mgumm...@mesosphere.io wrote:
Mesos also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress: 
https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at 
checkpointing https://spark. apache.org/docs/1.6.2/ streaming-programming-guide.
html#checkpointing For example, for Spark Streaming, you can not do any window 
operation in a
cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/prof ile/view?id=AAEWh2gBxianrb
Jd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpres s.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more typical
fashion with Spark using the now-instantiated distributed data.
But I don't get the sense that this meta-programming-ish style is really what
the OP was aiming at.
On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen < so...@cloudera.com > wrote:
Without a distributed storage system, your application can only create data on
the driver and send it out to the workers, and collect data back from the
workers. You can't read or write data in a distributed way. There are use cases
for this, but pretty limited (unless you're running on 1 machine).
I can't really imagine a serious use of (distributed) Spark without (distribute)
storage, in a way I don't think many apps exist that don't read/write data.
The premise here is not just replication, but partitioning data across compute
resources. With a distributed file system, your big input exists across a bunch
of machines and you can send the work to the pieces of data.
On Thu, Aug 25, 2016 at 7:57 PM, kant kodali < 

Re: Using spark to distribute jobs to standalone servers

2016-08-25 Thread Igor Berman
imho, you'll need to implement custom rdd with your locality settings(i.e.
custom implementation of discovering where each partition is located) +
setting for spark.locality.wait

On 24 August 2016 at 03:48, Mohit Jaggi  wrote:

> It is a bit hacky but possible. A lot depends on what kind of queries etc
> you want to run. You could write a data source that reads your data and
> keeps it partitioned the way you want, then use mapPartitions() to execute
> your code…
>
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
>
>
>
>
> On Aug 22, 2016, at 7:59 AM, Larry White  wrote:
>
> Hi,
>
> I have a bit of an unusual use-case and would *greatly* *appreciate* some
> feedback as to whether it is a good fit for spark.
>
> I have a network of compute/data servers configured as a tree as shown
> below
>
>- controller
>- server 1
>   - server 2
>   - server 3
>   - etc.
>
> There are ~20 servers, but the number is increasing to ~100.
>
> Each server contains a different dataset, all in the same format. Each is
> hosted by a different organization, and the data on every individual server
> is unique to that organization.
>
> Data *cannot* be replicated across servers using RDDs or any other means,
> for privacy/ownership reasons.
>
> Data *cannot* be retrieved to the controller, except in aggregate form,
> as the result of a query, for example.
>
> Because of this, there are currently no operations that treats the data as
> if it were a single data set: We could run a classifier on each site
> individually, but cannot for legal reasons, pull all the data into a single
> *physical* dataframe to run the classifier on all of it together.
>
> The servers are located across a wide geographic region (1,000s of miles)
>
> We would like to send jobs from the controller to be executed in parallel
> on all the servers, and retrieve the results to the controller. The jobs
> would consist of SQL-Heavy Java code for 'production' queries, and python
> or R code for ad-hoc queries and predictive modeling.
>
> Spark seems to have the capability to meet many of the individual
> requirements, but is it a reasonable platform overall for building this
> application?
>
> Thank you very much for your assistance.
>
> Larry
>
>
>
>


Re: Please assist: Building Docker image containing spark 2.0

2016-08-25 Thread Marco Mistroni
No i wont accept that :)
I can't believe i have wasted 3 hrs for a space!

Many thanks MIchael!

kr

On Thu, Aug 25, 2016 at 10:01 PM, Michael Gummelt 
wrote:

> You have a space between "build" and "mvn"
>
> On Thu, Aug 25, 2016 at 1:31 PM, Marco Mistroni 
> wrote:
>
>> HI all
>>  sorry for the partially off-topic, i hope there's someone on the list
>> who has tried the same and encountered similar issuse
>>
>> Ok so i have created a Docker file to build an ubuntu container which
>> inlcudes spark 2.0, but somehow when it gets to the point where it has to
>> kick off  ./build/mvn command, it errors out with the following
>>
>> ---> Running in 8c2aa6d59842
>> /bin/sh: 1: ./build: Permission denied
>> The command '/bin/sh -c ./build mvn -Pyarn -Phadoop-2.4
>> -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero code:
>> 126
>>
>> I am puzzled as i am root when i build the container, so i should not
>> encounter this issue (btw, if instead of running mvn from the build
>> directory  i use the mvn which i installed on the container, it works fine
>> but it's  painfully slow)
>>
>> here are the details of my Spark command( scala 2.10, java 1.7 , mvn
>> 3.3.9 and git have already been installed)
>>
>> # Spark
>> RUN echo "Installing Apache spark 2.0"
>> RUN git clone git://github.com/apache/spark.git
>> WORKDIR /spark
>> RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
>> clean package
>>
>>
>> Could anyone assist pls?
>>
>> kindest regarsd
>>  Marco
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: Converting DataFrame's int column to Double

2016-08-25 Thread Marco Mistroni
many tx Jestin!

On Thu, Aug 25, 2016 at 10:13 PM, Jestin Ma 
wrote:

> How about this:
>
> df.withColumn("doubles", col("ints").cast("double")).drop("ints")
>
> On Thu, Aug 25, 2016 at 2:09 PM, Marco Mistroni 
> wrote:
>
>> hi all
>>  i might be stuck in old code, but this is what i am doing to convert  a
>> DF int column to Double
>>
>> val intToDoubleFunc:(Int => Double) = lbl => lbl.toDouble
>> val labelToDblFunc = udf(intToDoubleFunc)
>>
>>
>> val convertedDF = df.withColumn("SurvivedDbl",
>> labelToDblFunc(col("Survived")))
>>
>> is there a  better way to do that?
>>
>> kr
>>
>>
>
>


Re: Converting DataFrame's int column to Double

2016-08-25 Thread Jestin Ma
How about this:

df.withColumn("doubles", col("ints").cast("double")).drop("ints")

On Thu, Aug 25, 2016 at 2:09 PM, Marco Mistroni  wrote:

> hi all
>  i might be stuck in old code, but this is what i am doing to convert  a
> DF int column to Double
>
> val intToDoubleFunc:(Int => Double) = lbl => lbl.toDouble
> val labelToDblFunc = udf(intToDoubleFunc)
>
>
> val convertedDF = df.withColumn("SurvivedDbl",
> labelToDblFunc(col("Survived")))
>
> is there a  better way to do that?
>
> kr
>
>


Converting DataFrame's int column to Double

2016-08-25 Thread Marco Mistroni
hi all
 i might be stuck in old code, but this is what i am doing to convert  a DF
int column to Double

val intToDoubleFunc:(Int => Double) = lbl => lbl.toDouble
val labelToDblFunc = udf(intToDoubleFunc)


val convertedDF = df.withColumn("SurvivedDbl",
labelToDblFunc(col("Survived")))

is there a  better way to do that?

kr


Re: Please assist: Building Docker image containing spark 2.0

2016-08-25 Thread Michael Gummelt
You have a space between "build" and "mvn"

On Thu, Aug 25, 2016 at 1:31 PM, Marco Mistroni  wrote:

> HI all
>  sorry for the partially off-topic, i hope there's someone on the list who
> has tried the same and encountered similar issuse
>
> Ok so i have created a Docker file to build an ubuntu container which
> inlcudes spark 2.0, but somehow when it gets to the point where it has to
> kick off  ./build/mvn command, it errors out with the following
>
> ---> Running in 8c2aa6d59842
> /bin/sh: 1: ./build: Permission denied
> The command '/bin/sh -c ./build mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero code:
> 126
>
> I am puzzled as i am root when i build the container, so i should not
> encounter this issue (btw, if instead of running mvn from the build
> directory  i use the mvn which i installed on the container, it works fine
> but it's  painfully slow)
>
> here are the details of my Spark command( scala 2.10, java 1.7 , mvn 3.3.9
> and git have already been installed)
>
> # Spark
> RUN echo "Installing Apache spark 2.0"
> RUN git clone git://github.com/apache/spark.git
> WORKDIR /spark
> RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
> clean package
>
>
> Could anyone assist pls?
>
> kindest regarsd
>  Marco
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Michael Gummelt
Mesos also uses ZK for leader election.  There seems to be some effort in
supporting etcd, but it's in progress:
https://issues.apache.org/jira/browse/MESOS-1806

On Thu, Aug 25, 2016 at 1:55 PM, kant kodali  wrote:

> @Ofir @Sean very good points.
>
> @Mike We dont use Kafka or Hive and I understand that Zookeeper can do
> many things but for our use case all we need is for high availability and
> given the devops people frustrations here in our company who had extensive
> experience managing large clusters in the past we would be very happy to
> avoid Zookeeper. I also heard that Mesos can provide High Availability
> through etcd and consul and if that is true I will be left with the
> following stack
>
> Spark + Mesos scheduler + Distributed File System or to be precise I
> should say Distributed Storage since S3 is an object store so I guess this
> will be HDFS for us + etcd & consul. Now the big question for me is how do
> I set all this up
>
>
>
> On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
>
>> Just to add one concrete example regarding HDFS dependency.
>> Have a look at checkpointing https://spark.apache.org/docs/1.6.2/
>> streaming-programming-guide.html#checkpointing
>> For example, for Spark Streaming, you can not do any window operation in
>> a cluster without checkpointing to HDFS (or S3).
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Hi Kant,
>>
>> I trust the following would be of use.
>>
>> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>>
>> In the heart of it and with reference to points you raised about HDFS,
>> one needs to have a working knowledge of Hadoop Core System including HDFS,
>> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
>> Data is all about horizontal scaling with master and nodes (as opposed to
>> vertical scaling like SQL Server running on a Host). and distributed data
>> (by default data is replicated three times on different nodes for
>> scalability and availability).
>>
>> Other members including Sean provided the limits on how far one operate
>> Spark in its own space. If you are going to deal with data (data in motion
>> and data at rest), then you will need to interact with some form of storage
>> and HDFS and compatible file systems like S3 are the natural choices.
>>
>> Zookeeper is not just about high availability. It is used in Spark
>> Streaming with Kafka, it is also used with Hive for concurrency. It is also
>> a distributed locking system.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 20:52, Mark Hamstra  wrote:
>>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>> wrote:
>>
>> One way you can start to make this make more sense, Sean, is if you
>> exploit the code/data duality so that the non-distributed data that you are
>> sending out from the driver is actually paying a role more like code (or at
>> least parameters.)  What is sent from the driver to an Executer is then
>> used (typically as seeds or parameters) to execute some procedure on the
>> Worker node that generates the actual data on the Workers.  After that, you
>> proceed to execute in a more typical fashion with Spark using the
>> now-instantiated distributed data.
>>
>> But I don't get the sense that this meta-programming-ish style is really
>> what the OP was aiming at.
>>
>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>
>> Without a distributed storage system, your application can only create
>> data on the driver and send it out to the workers, and collect data back
>> from the workers. You can't read or write data in a distributed way. There
>> are use cases for this, but pretty limited (unless you're running on 1
>> machine).
>>
>> I can't really imagine a serious use of (distributed) Spark without
>> (distribute) storage, in a way I don't think many apps exist that don't
>> read/write data.
>>
>> The premise here is not just replication, but partitioning data across
>> compute resources. With a distributed file system, your big input exists
>> across a bunch of machines and you can send the work to 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

@Ofir @Sean very good points.
@Mike We dont use Kafka or Hive and I understand that Zookeeper can do many
things but for our use case all we need is for high availability and given the
devops people frustrations here in our company who had extensive experience
managing large clusters in the past we would be very happy to avoid Zookeeper. I
also heard that Mesos can provide High Availability through etcd and consul and
if that is true I will be left with the following stack
Spark + Mesos scheduler + Distributed File System or to be precise I should say
Distributed Storage since S3 is an object store so I guess this will be HDFS for
us + etcd & consul. Now the big question for me is how do I set all this up





On Thu, Aug 25, 2016 1:35 PM, Ofir Manor ofir.ma...@equalum.io wrote:
Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing 
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing For example, for Spark Streaming, you can not do any window operation in a

cluster without checkpointing to HDFS (or S3).
Ofir Manor


Co-Founder & CTO | Equalum



Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:
Hi Kant,
I trust the following would be of use.
Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
In the heart of it and with reference to points you raised about HDFS, one needs
to have a working knowledge of Hadoop Core System including HDFS, Map-reduce
algorithm and Yarn whether one uses them or not. After all Big Data is all about
horizontal scaling with master and nodes (as opposed to vertical scaling like
SQL Server running on a Host). and distributed data (by default data is
replicated three times on different nodes for scalability and availability).
Other members including Sean provided the limits on how far one operate Spark in
its own space. If you are going to deal with data (data in motion and data at
rest), then you will need to interact with some form of storage and HDFS and
compatible file systems like S3 are the natural choices.
Zookeeper is not just about high availability. It is used in Spark Streaming
with Kafka, it is also used with Hive for concurrency. It is also a distributed
locking system.
HTH
Dr Mich Talebzadeh



LinkedIn https://www.linkedin.com/ profile/view?id= 
AAEWh2gBxianrbJd6zP6AcPCCd
OABUrV8Pw



http://talebzadehmich. wordpress.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 25 August 2016 at 20:52, Mark Hamstra < m...@clearstorydata.com > wrote:
s/playing a role/paying a role/
On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra < m...@clearstorydata.com > 
wrote:
One way you can start to make this make more sense, Sean, is if you exploit the
code/data duality so that the non-distributed data that you are sending out from
the driver is actually paying a role more like code (or at least parameters.)
What is sent from the driver to an Executer is then used (typically as seeds or
parameters) to execute some procedure on the Worker node that generates the
actual data on the Workers. After that, you proceed to execute in a more typical
fashion with Spark using the now-instantiated distributed data.
But I don't get the sense that this meta-programming-ish style is really what
the OP was aiming at.
On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen < so...@cloudera.com > wrote:
Without a distributed storage system, your application can only create data on
the driver and send it out to the workers, and collect data back from the
workers. You can't read or write data in a distributed way. There are use cases
for this, but pretty limited (unless you're running on 1 machine).
I can't really imagine a serious use of (distributed) Spark without (distribute)
storage, in a way I don't think many apps exist that don't read/write data.
The premise here is not just replication, but partitioning data across compute
resources. With a distributed file system, your big input exists across a bunch
of machines and you can send the work to the pieces of data.
On Thu, Aug 25, 2016 at 7:57 PM, kant kodali < kanth...@gmail.com > wrote:
@Mich I understand why I would need Zookeeper. It is there for fault tolerance
given that spark is a master-slave architecture and when a mater goes down
zookeeper will run a leader election algorithm to elect a new leader however
DevOps hate Zookeeper they would be much happier to go with etcd & consul and
looks like if we mesos scheduler we should be able to drop Zookeeper.
HDFS I am still trying to understand why I would need for spark. I understand
the purpose of distributed file systems in general but I 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Ofir Manor
Just to add one concrete example regarding HDFS dependency.
Have a look at checkpointing
https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing
For example, for Spark Streaming, you can not do any window operation in a
cluster without checkpointing to HDFS (or S3).

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh  wrote:

> Hi Kant,
>
> I trust the following would be of use.
>
> Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.
>
> In the heart of it and with reference to points you raised about HDFS, one
> needs to have a working knowledge of Hadoop Core System including HDFS,
> Map-reduce algorithm and Yarn whether one uses them or not. After all Big
> Data is all about horizontal scaling with master and nodes (as opposed to
> vertical scaling like SQL Server running on a Host). and distributed data
> (by default data is replicated three times on different nodes for
> scalability and availability).
>
> Other members including Sean provided the limits on how far one operate
> Spark in its own space. If you are going to deal with data (data in motion
> and data at rest), then you will need to interact with some form of storage
> and HDFS and compatible file systems like S3 are the natural choices.
>
> Zookeeper is not just about high availability. It is used in Spark
> Streaming with Kafka, it is also used with Hive for concurrency. It is also
> a distributed locking system.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 20:52, Mark Hamstra  wrote:
>
>> s/playing a role/paying a role/
>>
>> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
>> wrote:
>>
>>> One way you can start to make this make more sense, Sean, is if you
>>> exploit the code/data duality so that the non-distributed data that you are
>>> sending out from the driver is actually paying a role more like code (or at
>>> least parameters.)  What is sent from the driver to an Executer is then
>>> used (typically as seeds or parameters) to execute some procedure on the
>>> Worker node that generates the actual data on the Workers.  After that, you
>>> proceed to execute in a more typical fashion with Spark using the
>>> now-instantiated distributed data.
>>>
>>> But I don't get the sense that this meta-programming-ish style is really
>>> what the OP was aiming at.
>>>
>>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>>
 Without a distributed storage system, your application can only create
 data on the driver and send it out to the workers, and collect data back
 from the workers. You can't read or write data in a distributed way. There
 are use cases for this, but pretty limited (unless you're running on 1
 machine).

 I can't really imagine a serious use of (distributed) Spark without
 (distribute) storage, in a way I don't think many apps exist that don't
 read/write data.

 The premise here is not just replication, but partitioning data across
 compute resources. With a distributed file system, your big input exists
 across a bunch of machines and you can send the work to the pieces of data.

 On Thu, Aug 25, 2016 at 7:57 PM, kant kodali 
 wrote:

> @Mich I understand why I would need Zookeeper. It is there for fault
> tolerance given that spark is a master-slave architecture and when a mater
> goes down zookeeper will run a leader election algorithm to elect a new
> leader however DevOps hate Zookeeper they would be much happier to go with
> etcd & consul and looks like if we mesos scheduler we should be able to
> drop Zookeeper.
>
> HDFS I am still trying to understand why I would need for spark. I
> understand the purpose of distributed file systems in general but I don't
> understand in the context of spark since many people say you can run a
> spark distributed cluster in a stand alone mode but I am not sure what are
> its pros/cons if we do it that way. In a hadoop world I understand that 
> one
> of the reasons HDFS is there is for replication other words if we write
> some data to a HDFS it will store that block across different nodes such
> that if one 

Please assist: Building Docker image containing spark 2.0

2016-08-25 Thread Marco Mistroni
HI all
 sorry for the partially off-topic, i hope there's someone on the list who
has tried the same and encountered similar issuse

Ok so i have created a Docker file to build an ubuntu container which
inlcudes spark 2.0, but somehow when it gets to the point where it has to
kick off  ./build/mvn command, it errors out with the following

---> Running in 8c2aa6d59842
/bin/sh: 1: ./build: Permission denied
The command '/bin/sh -c ./build mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero code:
126

I am puzzled as i am root when i build the container, so i should not
encounter this issue (btw, if instead of running mvn from the build
directory  i use the mvn which i installed on the container, it works fine
but it's  painfully slow)

here are the details of my Spark command( scala 2.10, java 1.7 , mvn 3.3.9
and git have already been installed)

# Spark
RUN echo "Installing Apache spark 2.0"
RUN git clone git://github.com/apache/spark.git
WORKDIR /spark
RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
clean package


Could anyone assist pls?

kindest regarsd
 Marco


Insert non-null values from dataframe

2016-08-25 Thread Selvam Raman
Hi ,

Dataframe:
colA colB colC colD colE
1 2 3 4 5
1 2 3 null null
1 null null  null 5
null null  3 4 5

I want to insert dataframe to nosql database, where null occupies
values(Cassandra). so i have to insert the column which has non-null values
in the row.

Expected:

Record 1: (1,2,3,4,5)
Record 2:(1,2,3)
Record 3:(1,5)
Record 4:(3,4,5)

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: quick question

2016-08-25 Thread kant kodali
@Sivakumaran Thanks a lot and this should conclude the thread!





On Thu, Aug 25, 2016 12:39 PM, Sivakumaran S siva.kuma...@me.com wrote:
A Spark job using a streaming context is an endless “while" loop till you kill
it or specify when to stop. Initiate a TCP Server before you start the stream
processing ( 
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_a_WebSocket_server_in_Java
 ).
As long as your driver program is running, it will have a TCP server listening
to connections on the port you specify (you can check with netstat).
And as long as your job is running, a client (A browser in your case running the
dashboard code) will be able to connect to the TCP server running in your Spark
job and receive the data that you write from the TCP Server.
As per the websocket protocol, this connection is an open connection. Read this
too - 
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers
Once you have the data you need at the client, you just need to figure out a way
to push into the javascript object holding the data in your dashboard and
refresh it. If it is an array, you can just take off the oldest data and add the
latest data to it. If it is a hash or a dictionary, you could just update the
value.
I would suggest using JSON for server-client communication. It is easier to
navigate JSON objects in Javascript :) But your requirements may vary.
This may help too ( 
http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/HomeWebsocket/WebsocketHome.html#section7
 )
Regards,
Sivakumaran S



On 25-Aug-2016, at 8:09 PM, kant kodali < kanth...@gmail.com > wrote:
Your assumption is right (thats what I intend to do). My driver code will be in
Java. The link sent by Kevin is a API reference to websocket. I understand how 
websockets works in general but my question was more geared
towards seeing the end to end path on how front end dashboard gets updated in 
realtime. when we collect the data back to the driver program
and finished writing data to websocket client the websocket connection 
terminate right so
1) is Spark driver program something that needs to run for ever like a typical
server? if not, 2) then do we need to open a web socket connection each time 
when the task
terminates?





On Thu, Aug 25, 2016 6:06 AM, Sivakumaran S siva.kuma...@me.com wrote:
I am assuming that you are doing some calculations over a time window. At the
end of the calculations (using RDDs or SQL), once you have collected the data
back to the driver program, you format the data in the way your client
(dashboard) requires it and write it to the websocket.
Is your driver code in Python? The link Kevin has sent should start you off.
Regards,
Sivakumaran
On 25-Aug-2016, at 11:53 AM, kant kodali < kanth...@gmail.com > wrote:
yes for now it will be Spark Streaming Job but later it may change.





On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com wrote:
Is this a Spark Streaming job?
Regards,
Sivakumaran S

@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker 
machines to some node.js server in that case
the websocket connection terminates after the spark task is completed right ? 
and when new data comes in a new task gets created
and opens a new websocket connection again…is that how it should be
On 25-Aug-2016, at 7:08 AM, kant kodali < kanth...@gmail.com > wrote:
@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker
machines to some node.js server in that case the websocket connection terminates
after the spark task is completed right ? and when new data comes in a new task
gets created and opens a new websocket connection again…is that how it should
be?





On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com wrote:
You create a websocket object in your spark code and write your data to the
socket. You create a websocket object in your dashboard code and receive the
data in realtime and update the dashboard. You can use Node.js in your dashboard
( socket.io ). I am sure there are other ways too.
Does that help?
Sivakumaran S
On 25-Aug-2016, at 6:30 AM, kant kodali < kanth...@gmail.com > wrote:
so I would need to open a websocket connection from spark worker machine to
where?





On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com wrote:
In the diagram you referenced, a real-time dashboard can be created using
WebSockets. This technology essentially allows your web page to keep an active
line of communication between the client and server, in which case you can
detect and display new information without requiring any user input of page
refreshes. The link below contains additional information on this concept, as
well as links to several different implementations (based on your programming
language preferences).

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
Hi Kant,

I trust the following would be of use.

Big Data depends on Hadoop Ecosystem from whichever angle one looks at it.

In the heart of it and with reference to points you raised about HDFS, one
needs to have a working knowledge of Hadoop Core System including HDFS,
Map-reduce algorithm and Yarn whether one uses them or not. After all Big
Data is all about horizontal scaling with master and nodes (as opposed to
vertical scaling like SQL Server running on a Host). and distributed data
(by default data is replicated three times on different nodes for
scalability and availability).

Other members including Sean provided the limits on how far one operate
Spark in its own space. If you are going to deal with data (data in motion
and data at rest), then you will need to interact with some form of storage
and HDFS and compatible file systems like S3 are the natural choices.

Zookeeper is not just about high availability. It is used in Spark
Streaming with Kafka, it is also used with Hive for concurrency. It is also
a distributed locking system.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 20:52, Mark Hamstra  wrote:

> s/playing a role/paying a role/
>
> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
> wrote:
>
>> One way you can start to make this make more sense, Sean, is if you
>> exploit the code/data duality so that the non-distributed data that you are
>> sending out from the driver is actually paying a role more like code (or at
>> least parameters.)  What is sent from the driver to an Executer is then
>> used (typically as seeds or parameters) to execute some procedure on the
>> Worker node that generates the actual data on the Workers.  After that, you
>> proceed to execute in a more typical fashion with Spark using the
>> now-instantiated distributed data.
>>
>> But I don't get the sense that this meta-programming-ish style is really
>> what the OP was aiming at.
>>
>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>>
>>> Without a distributed storage system, your application can only create
>>> data on the driver and send it out to the workers, and collect data back
>>> from the workers. You can't read or write data in a distributed way. There
>>> are use cases for this, but pretty limited (unless you're running on 1
>>> machine).
>>>
>>> I can't really imagine a serious use of (distributed) Spark without
>>> (distribute) storage, in a way I don't think many apps exist that don't
>>> read/write data.
>>>
>>> The premise here is not just replication, but partitioning data across
>>> compute resources. With a distributed file system, your big input exists
>>> across a bunch of machines and you can send the work to the pieces of data.
>>>
>>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:
>>>
 @Mich I understand why I would need Zookeeper. It is there for fault
 tolerance given that spark is a master-slave architecture and when a mater
 goes down zookeeper will run a leader election algorithm to elect a new
 leader however DevOps hate Zookeeper they would be much happier to go with
 etcd & consul and looks like if we mesos scheduler we should be able to
 drop Zookeeper.

 HDFS I am still trying to understand why I would need for spark. I
 understand the purpose of distributed file systems in general but I don't
 understand in the context of spark since many people say you can run a
 spark distributed cluster in a stand alone mode but I am not sure what are
 its pros/cons if we do it that way. In a hadoop world I understand that one
 of the reasons HDFS is there is for replication other words if we write
 some data to a HDFS it will store that block across different nodes such
 that if one of nodes goes down it can still retrieve that block from other
 nodes. In the context of spark I am not really sure because 1) I am new 2)
 Spark paper says it doesn't replicate data instead it stores the
 lineage(all the transformations) such that it can reconstruct it.






 On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
 wrote:

> You can use Spark on Oracle as a query tool.
>
> It all depends on the mode of the operation.
>
> If you running Spark with yarn-client/cluster then you will need yarn.
> It comes as 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
s/playing a role/paying a role/

On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra 
wrote:

> One way you can start to make this make more sense, Sean, is if you
> exploit the code/data duality so that the non-distributed data that you are
> sending out from the driver is actually paying a role more like code (or at
> least parameters.)  What is sent from the driver to an Executer is then
> used (typically as seeds or parameters) to execute some procedure on the
> Worker node that generates the actual data on the Workers.  After that, you
> proceed to execute in a more typical fashion with Spark using the
> now-instantiated distributed data.
>
> But I don't get the sense that this meta-programming-ish style is really
> what the OP was aiming at.
>
> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:
>
>> Without a distributed storage system, your application can only create
>> data on the driver and send it out to the workers, and collect data back
>> from the workers. You can't read or write data in a distributed way. There
>> are use cases for this, but pretty limited (unless you're running on 1
>> machine).
>>
>> I can't really imagine a serious use of (distributed) Spark without
>> (distribute) storage, in a way I don't think many apps exist that don't
>> read/write data.
>>
>> The premise here is not just replication, but partitioning data across
>> compute resources. With a distributed file system, your big input exists
>> across a bunch of machines and you can send the work to the pieces of data.
>>
>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:
>>
>>> @Mich I understand why I would need Zookeeper. It is there for fault
>>> tolerance given that spark is a master-slave architecture and when a mater
>>> goes down zookeeper will run a leader election algorithm to elect a new
>>> leader however DevOps hate Zookeeper they would be much happier to go with
>>> etcd & consul and looks like if we mesos scheduler we should be able to
>>> drop Zookeeper.
>>>
>>> HDFS I am still trying to understand why I would need for spark. I
>>> understand the purpose of distributed file systems in general but I don't
>>> understand in the context of spark since many people say you can run a
>>> spark distributed cluster in a stand alone mode but I am not sure what are
>>> its pros/cons if we do it that way. In a hadoop world I understand that one
>>> of the reasons HDFS is there is for replication other words if we write
>>> some data to a HDFS it will store that block across different nodes such
>>> that if one of nodes goes down it can still retrieve that block from other
>>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>>> Spark paper says it doesn't replicate data instead it stores the
>>> lineage(all the transformations) such that it can reconstruct it.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>>> wrote:
>>>
 You can use Spark on Oracle as a query tool.

 It all depends on the mode of the operation.

 If you running Spark with yarn-client/cluster then you will need yarn.
 It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).

 I have not gone and installed Yarn without installing Hadoop.

 What is the overriding reason to have the Spark on its own?

  You can use Spark in Local or Standalone mode if you do not want
 Hadoop core.

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 24 August 2016 at 21:54, kant kodali  wrote:

 What do I loose if I run spark without using HDFS or Zookeper ? which
 of them is almost a must in practice?



>>
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mark Hamstra
One way you can start to make this make more sense, Sean, is if you exploit
the code/data duality so that the non-distributed data that you are sending
out from the driver is actually paying a role more like code (or at least
parameters.)  What is sent from the driver to an Executer is then used
(typically as seeds or parameters) to execute some procedure on the Worker
node that generates the actual data on the Workers.  After that, you
proceed to execute in a more typical fashion with Spark using the
now-instantiated distributed data.

But I don't get the sense that this meta-programming-ish style is really
what the OP was aiming at.

On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen  wrote:

> Without a distributed storage system, your application can only create
> data on the driver and send it out to the workers, and collect data back
> from the workers. You can't read or write data in a distributed way. There
> are use cases for this, but pretty limited (unless you're running on 1
> machine).
>
> I can't really imagine a serious use of (distributed) Spark without
> (distribute) storage, in a way I don't think many apps exist that don't
> read/write data.
>
> The premise here is not just replication, but partitioning data across
> compute resources. With a distributed file system, your big input exists
> across a bunch of machines and you can send the work to the pieces of data.
>
> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:
>
>> @Mich I understand why I would need Zookeeper. It is there for fault
>> tolerance given that spark is a master-slave architecture and when a mater
>> goes down zookeeper will run a leader election algorithm to elect a new
>> leader however DevOps hate Zookeeper they would be much happier to go with
>> etcd & consul and looks like if we mesos scheduler we should be able to
>> drop Zookeeper.
>>
>> HDFS I am still trying to understand why I would need for spark. I
>> understand the purpose of distributed file systems in general but I don't
>> understand in the context of spark since many people say you can run a
>> spark distributed cluster in a stand alone mode but I am not sure what are
>> its pros/cons if we do it that way. In a hadoop world I understand that one
>> of the reasons HDFS is there is for replication other words if we write
>> some data to a HDFS it will store that block across different nodes such
>> that if one of nodes goes down it can still retrieve that block from other
>> nodes. In the context of spark I am not really sure because 1) I am new 2)
>> Spark paper says it doesn't replicate data instead it stores the
>> lineage(all the transformations) such that it can reconstruct it.
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
>> wrote:
>>
>>> You can use Spark on Oracle as a query tool.
>>>
>>> It all depends on the mode of the operation.
>>>
>>> If you running Spark with yarn-client/cluster then you will need yarn.
>>> It comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>>
>>> I have not gone and installed Yarn without installing Hadoop.
>>>
>>> What is the overriding reason to have the Spark on its own?
>>>
>>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>>> core.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 August 2016 at 21:54, kant kodali  wrote:
>>>
>>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>>> them is almost a must in practice?
>>>
>>>
>>>
>


Re: quick question

2016-08-25 Thread Sivakumaran S
A Spark job using a streaming context is an endless “while" loop till you kill 
it or specify when to stop. Initiate a TCP Server before you start the stream 
processing 
(https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_a_WebSocket_server_in_Java
 
).
 

As long as your driver program is running, it will have a TCP server listening 
to connections on the port you specify (you can check with netstat).

And as long as your job is running, a client (A browser in your case running 
the dashboard code) will be able to connect to the TCP server running in your 
Spark job and receive the data that you write from the TCP Server.

As per the websocket protocol, this connection is an open connection. Read this 
too - 
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers
 


Once you have the data you need at the client, you just need to figure out a 
way to push into the javascript object holding the data in your dashboard and 
refresh it. If it is an array, you can just take off the oldest data and add 
the latest data to it. If it is a hash or a dictionary, you could just update 
the value.

I would suggest using JSON for server-client communication. It is easier to 
navigate JSON objects in Javascript :) But your requirements may vary.

This may help too 
(http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/HomeWebsocket/WebsocketHome.html#section7
 
)

Regards,

Sivakumaran S 




> On 25-Aug-2016, at 8:09 PM, kant kodali  wrote:
> 
> Your assumption is right (thats what I intend to do). My driver code will be 
> in Java. The link sent by Kevin is a API reference to websocket. 
> I understand how websockets works in general but my question was more geared 
> towards seeing the end to end path on how front end dashboard 
> gets updated in realtime. when we collect the data back to the driver program 
> and finished writing data to websocket client the websocket connection
>  terminate right so 
> 
> 1) is Spark driver program something that needs to run for ever like a 
> typical server? if not,
> 2) then do we need to open a web socket connection each time when the task 
> terminates?
> 
> 
> 
> 
> 
> On Thu, Aug 25, 2016 6:06 AM, Sivakumaran S siva.kuma...@me.com 
>  wrote:
> I am assuming that you are doing some calculations over a time window. At the 
> end of the calculations (using RDDs or SQL), once you have collected the data 
> back to the driver program, you format the data in the way your client 
> (dashboard) requires it and write it to the websocket. 
> 
> Is your driver code in Python? The link Kevin has sent should start you off.
> 
> Regards,
> 
> Sivakumaran 
>> On 25-Aug-2016, at 11:53 AM, kant kodali > > wrote:
>> 
>> yes for now it will be Spark Streaming Job but later it may change.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com 
>>  wrote:
>> Is this a Spark Streaming job?
>> 
>> Regards,
>> 
>> Sivakumaran S
>> 
>> 
>>> @Sivakumaran when you say create a web socket object in your spark code I 
>>> assume you meant a spark "task" opening websocket 
>>> connection from one of the worker machines to some node.js server in that 
>>> case the websocket connection terminates after the spark 
>>> task is completed right ? and when new data comes in a new task gets 
>>> created and opens a new websocket connection again…is that how it should be
>> 
>>> On 25-Aug-2016, at 7:08 AM, kant kodali >> > wrote:
>>> 
>>> @Sivakumaran when you say create a web socket object in your spark code I 
>>> assume you meant a spark "task" opening websocket connection from one of 
>>> the worker machines to some node.js server in that case the websocket 
>>> connection terminates after the spark task is completed right ? and when 
>>> new data comes in a new task gets created and opens a new websocket 
>>> connection again…is that how it should be?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com 
>>>  wrote:
>>> You create a websocket object in your spark code and write your data to the 
>>> socket. You create a websocket object in your dashboard code and receive 
>>> the data in realtime and update the dashboard. You can use Node.js in your 
>>> dashboard (socket.io ). I am sure there are other ways 
>>> too.
>>> 
>>> Does that help?
>>> 
>>> Sivakumaran S
>>> 
 On 25-Aug-2016, at 6:30 AM, kant kodali 

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Sean Owen
Without a distributed storage system, your application can only create data
on the driver and send it out to the workers, and collect data back from
the workers. You can't read or write data in a distributed way. There are
use cases for this, but pretty limited (unless you're running on 1 machine).

I can't really imagine a serious use of (distributed) Spark without
(distribute) storage, in a way I don't think many apps exist that don't
read/write data.

The premise here is not just replication, but partitioning data across
compute resources. With a distributed file system, your big input exists
across a bunch of machines and you can send the work to the pieces of data.

On Thu, Aug 25, 2016 at 7:57 PM, kant kodali  wrote:

> @Mich I understand why I would need Zookeeper. It is there for fault
> tolerance given that spark is a master-slave architecture and when a mater
> goes down zookeeper will run a leader election algorithm to elect a new
> leader however DevOps hate Zookeeper they would be much happier to go with
> etcd & consul and looks like if we mesos scheduler we should be able to
> drop Zookeeper.
>
> HDFS I am still trying to understand why I would need for spark. I
> understand the purpose of distributed file systems in general but I don't
> understand in the context of spark since many people say you can run a
> spark distributed cluster in a stand alone mode but I am not sure what are
> its pros/cons if we do it that way. In a hadoop world I understand that one
> of the reasons HDFS is there is for replication other words if we write
> some data to a HDFS it will store that block across different nodes such
> that if one of nodes goes down it can still retrieve that block from other
> nodes. In the context of spark I am not really sure because 1) I am new 2)
> Spark paper says it doesn't replicate data instead it stores the
> lineage(all the transformations) such that it can reconstruct it.
>
>
>
>
>
>
> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com
> wrote:
>
>> You can use Spark on Oracle as a query tool.
>>
>> It all depends on the mode of the operation.
>>
>> If you running Spark with yarn-client/cluster then you will need yarn. It
>> comes as part of Hadoop core (HDFS, Map-reduce and Yarn).
>>
>> I have not gone and installed Yarn without installing Hadoop.
>>
>> What is the overriding reason to have the Spark on its own?
>>
>>  You can use Spark in Local or Standalone mode if you do not want Hadoop
>> core.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 21:54, kant kodali  wrote:
>>
>> What do I loose if I run spark without using HDFS or Zookeper ? which of
>> them is almost a must in practice?
>>
>>
>>


Re: Caching broadcasted DataFrames?

2016-08-25 Thread Takeshi Yamamuro
Hi,

you need to cache df1 to prevent re-computation (including disk reads)
because spark re-broadcasts
data every sql execution.

// maropu

On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma 
wrote:

> I have a DataFrame d1 that I would like to join with two separate
> DataFrames.
> Since d1 is small enough, I broadcast it.
>
> What I understand about cache vs broadcast is that cache leads to each
> executor storing the partitions its assigned in memory (cluster-wide
> in-memory). Broadcast leads to each node (with multiple executors) storing
> a copy of the dataset (all partitions) inside its own memory.
>
> Since the dataset for d1 is used in two separate joins, should I also
> persist it to prevent reading it from disk again? Or would broadcasting the
> data already take care of that?
>
>
> Thank you,
> Jestin
>



-- 
---
Takeshi Yamamuro


Running yarn with spark not working with Java 8

2016-08-25 Thread Anil Langote
Hi All,

I have cluster with 1 master and 6 slaves which uses pre-built version of
hadoop 2.6.0 and spark 1.6.2. I was running hadoop MR and spark jobs
without any problem with openjdk 7 installed on all the nodes. However when
I upgraded openjdk 7 to openjdk 8 on all nodes, spark submit and
spark-shell with yarn caused error.

I have exported JAVA_HOME on .bashrc and have set the openjdk 8 as default
java using

sudo update-alternatives --config java
sudo update-alternatives --config javac

these commands. Also I have tried with oracle java 8 and the same error
comes up. The container logs on the slave nodes have same error as below.

I have tried with spark 1.6.2 pre-built version, spark 2.0 pre-built
version and also tried with spark 2.0 by building it myself.

Hadoop job works perfectly even after upgrading to java 8. When i switch
back to java 7, spark works fine.

My scala version is 2.11 and OS is Ubuntu 14.04.4 LTS .

It will be very great if someone can give me an idea to solve this problem.

Thanks!

ps I have changed my IP address as xxx.xxx.xxx.xx on the logs

Stack Trace with Oracle java 8

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/tmp/hadoop-hd_spark/nm-local-dir/usercache/hd_spark/filecache/17/__spark_libs__8247267244939901627.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/08/17 14:05:11 INFO executor.CoarseGrainedExecutorBackend: Started
daemon with process name: 23541@slave01
16/08/17 14:05:11 INFO util.SignalUtils: Registered signal handler for TERM
16/08/17 14:05:11 INFO util.SignalUtils: Registered signal handler for HUP
16/08/17 14:05:11 INFO util.SignalUtils: Registered signal handler for INT
16/08/17 14:05:11 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
16/08/17 14:05:11 INFO spark.SecurityManager: Changing view acls to: hd_spark
16/08/17 14:05:11 INFO spark.SecurityManager: Changing modify acls to: hd_spark
16/08/17 14:05:11 INFO spark.SecurityManager: Changing view acls groups to:
16/08/17 14:05:11 INFO spark.SecurityManager: Changing modify acls groups to:
16/08/17 14:05:11 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users  with view
permissions: Set(hd_spark); groups with view permissions: Set(); users
 with modify permissions: Set(hd_spark); groups with modify
permissions: Set()
16/08/17 14:05:12 INFO client.TransportClientFactory: Successfully
created connection to /xxx.xxx.xxx.xx:37417 after 78 ms (0 ms spent in
bootstraps)
16/08/17 14:05:12 INFO spark.SecurityManager: Changing view acls to: hd_spark
16/08/17 14:05:12 INFO spark.SecurityManager: Changing modify acls to: hd_spark
16/08/17 14:05:12 INFO spark.SecurityManager: Changing view acls groups to:
16/08/17 14:05:12 INFO spark.SecurityManager: Changing modify acls groups to:
16/08/17 14:05:12 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users  with view
permissions: Set(hd_spark); groups with view permissions: Set(); users
 with modify permissions: Set(hd_spark); groups with modify
permissions: Set()
16/08/17 14:05:12 INFO client.TransportClientFactory: Successfully
created connection to /xxx.xxx.xxx.xx:37417 after 1 ms (0 ms spent in
bootstraps)
16/08/17 14:05:12 INFO storage.DiskBlockManager: Created local
directory at 
/tmp/hadoop-hd_spark/nm-local-dir/usercache/hd_spark/appcache/application_1471352972661_0005/blockmgr-d9f23a56-1420-4cd4-abfd-ae9e128c688c
16/08/17 14:05:12 INFO memory.MemoryStore: MemoryStore started with
capacity 366.3 MB
16/08/17 14:05:12 INFO executor.CoarseGrainedExecutorBackend:
Connecting to driver:
spark://coarsegrainedschedu...@xxx.xxx.xxx.xx:37417
16/08/17 14:05:13 ERROR executor.CoarseGrainedExecutorBackend:
RECEIVED SIGNAL TERM
16/08/17 14:05:13 INFO storage.DiskBlockManager: Shutdown hook called
16/08/17 14:05:13 INFO util.ShutdownHookManager: Shutdown hook called



Stack Trace:

16/08/17 14:06:22 ERROR client.TransportClient: Failed to send RPC
4688442384427245199 to /xxx.xxx.xxx.xx:42955:
java.nio.channels.ClosedChannelExce  ption
java.nio.channels.ClosedChannelException
16/08/17 14:06:22 WARN netty.NettyRpcEndpointRef: Error sending
message [message = RequestExecutors(0,0,Map())] in 1 attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 

Re: quick question

2016-08-25 Thread kant kodali
Your assumption is right (thats what I intend to do). My driver code will be in
Java. The link sent by Kevin is a API reference to websocket. I understand how
websockets works in general but my question was more geared towards seeing the
end to end path on how front end dashboard gets updated in realtime. when we
collect the data back to the driver program and finished writing data to
websocket client the websocket connection terminate right so
1) is Spark driver program something that needs to run for ever like a typical
server? if not, 2) then do we need to open a web socket connection each time 
when the task
terminates?





On Thu, Aug 25, 2016 6:06 AM, Sivakumaran S siva.kuma...@me.com wrote:
I am assuming that you are doing some calculations over a time window. At the
end of the calculations (using RDDs or SQL), once you have collected the data
back to the driver program, you format the data in the way your client
(dashboard) requires it and write it to the websocket.
Is your driver code in Python? The link Kevin has sent should start you off.
Regards,
Sivakumaran
On 25-Aug-2016, at 11:53 AM, kant kodali < kanth...@gmail.com > wrote:
yes for now it will be Spark Streaming Job but later it may change.





On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com wrote:
Is this a Spark Streaming job?
Regards,
Sivakumaran S

@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker 
machines to some node.js server in that case
the websocket connection terminates after the spark task is completed right ? 
and when new data comes in a new task gets created
and opens a new websocket connection again…is that how it should be
On 25-Aug-2016, at 7:08 AM, kant kodali < kanth...@gmail.com > wrote:
@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker
machines to some node.js server in that case the websocket connection terminates
after the spark task is completed right ? and when new data comes in a new task
gets created and opens a new websocket connection again…is that how it should
be?





On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com wrote:
You create a websocket object in your spark code and write your data to the
socket. You create a websocket object in your dashboard code and receive the
data in realtime and update the dashboard. You can use Node.js in your dashboard
( socket.io ). I am sure there are other ways too.
Does that help?
Sivakumaran S
On 25-Aug-2016, at 6:30 AM, kant kodali < kanth...@gmail.com > wrote:
so I would need to open a websocket connection from spark worker machine to
where?





On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com wrote:
In the diagram you referenced, a real-time dashboard can be created using
WebSockets. This technology essentially allows your web page to keep an active
line of communication between the client and server, in which case you can
detect and display new information without requiring any user input of page
refreshes. The link below contains additional information on this concept, as
well as links to several different implementations (based on your programming
language preferences).
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API

Hope this helps! - Kevin
On Wed, Aug 24, 2016 at 3:52 PM, kant kodali < kanth...@gmail.com > wrote:

-- Forwarded message --
From: kant kodali < kanth...@gmail.com >
Date: Wed, Aug 24, 2016 at 1:49 PM
Subject: quick question
To: d...@spark.apache.org , us...@spark.apache.org



In this picture what does "Dashboards" really mean? is there a open source
project which can allow me to push the results back to Dashboards such that
Dashboards are always in sync with real time updates? (a push based solution is
better than poll but i am open to whatever is possible given the above picture)

Re: UDF on lpad

2016-08-25 Thread Mike Metzger
Is this what you're after?

def padString(id: Int, chars: String, length: Int): String =
 chars * length + id.toString

padString(123, "0", 10)

res4: String = 00123

Mike

On Thu, Aug 25, 2016 at 12:39 PM, Mich Talebzadeh  wrote:

> Thanks Mike.
>
> Can one turn the first example into a generic UDF similar to the output
> from below where 10 "0" are padded to the left of 123
>
>   def padString(id: Int, chars: String, length: Int): String =
>  (0 until length).map(_ => chars(Random.nextInt(chars.length))).mkString
> + id.toString
>
> scala> padString(123, "0", 10)
> res6: String = 00123
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 17:29, Mike Metzger 
> wrote:
>
>> Are you trying to always add x numbers of digits / characters or are you
>> trying to pad to a specific length?  If the latter, try using format
>> strings:
>>
>> // Pad to 10 0 characters
>> val c = 123
>> f"$c%010d"
>>
>> // Result is 000123
>>
>>
>> // Pad to 10 total characters with 0's
>> val c = 123.87
>> f"$c%010.2f"
>>
>> // Result is 123.87
>>
>>
>> You can also do inline operations on the values before formatting.  I've
>> used this specifically to pad for hex digits from strings.
>>
>> val d = "100"
>> val hexstring = f"0x${d.toInt}%08X"
>>
>> // hexstring is 0x0064
>>
>>
>> Thanks
>>
>> Mike
>>
>> On Thu, Aug 25, 2016 at 9:27 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Ok I tried this
>>>
>>> def padString(s: String, chars: String, length: Int): String =
>>>  |  (0 until length).map(_ => 
>>> chars(Random.nextInt(chars.length))).mkString
>>> + s
>>>
>>> padString: (s: String, chars: String, length: Int)String
>>> And use it like below:
>>>
>>> Example left pad the figure 12345.87 with 10 "0"s
>>>
>>> padString("12345.87", "0", 10)
>>> res79: String = 0012345.87
>>>
>>> Any better way?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 25 August 2016 at 12:06, Mich Talebzadeh 
>>> wrote:
>>>
 Hi,

 This UDF on substring works

 scala> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
 s.substring(start, end) }
 SubstrUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
 UserDefinedFunction(,StringType,Some(List(StringType,
 IntegerType, IntegerType)))

 I want something similar to this

 scala> sql("""select lpad("str", 10, "0")""").show
 ++
 |lpad(str, 10, 0)|
 ++
 |  000str|
 ++

 scala> val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
 lpad(s, len, chars) }
 :40: error: type mismatch;
  found   : String
  required: org.apache.spark.sql.Column
val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
 lpad(s, len, chars) }


 Any ideas?

 Thanks

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>>
>>>
>>
>


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali

@Mich I understand why I would need Zookeeper. It is there for fault tolerance
given that spark is a master-slave architecture and when a mater goes down
zookeeper will run a leader election algorithm to elect a new leader however
DevOps hate Zookeeper they would be much happier to go with etcd & consul and
looks like if we mesos scheduler we should be able to drop Zookeeper.
HDFS I am still trying to understand why I would need for spark. I understand
the purpose of distributed file systems in general but I don't understand in the
context of spark since many people say you can run a spark distributed cluster
in a stand alone mode but I am not sure what are its pros/cons if we do it that
way. In a hadoop world I understand that one of the reasons HDFS is there is for
replication other words if we write some data to a HDFS it will store that block
across different nodes such that if one of nodes goes down it can still retrieve
that block from other nodes. In the context of spark I am not really sure
because 1) I am new 2) Spark paper says it doesn't replicate data instead it
stores the lineage(all the transformations) such that it can reconstruct it.







On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh mich.talebza...@gmail.com wrote:
You can use Spark on Oracle as a query tool.
It all depends on the mode of the operation.
If you running Spark with yarn-client/cluster then you will need yarn. It comes
as part of Hadoop core (HDFS, Map-reduce and Yarn).
I have not gone and installed Yarn without installing Hadoop.
What is the overriding reason to have the Spark on its own?
You can use Spark in Local or Standalone mode if you do not want Hadoop core.
HTH
Dr Mich Talebzadeh



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw




http://talebzadehmich.wordpress.com




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any
other property which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.




On 24 August 2016 at 21:54, kant kodali < kanth...@gmail.com > wrote:
What do I loose if I run spark without using HDFS or Zookeper ? which of them is
almost a must in practice?

Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
The reason your second example works is because of a closure capture behavior. 
It should be ok for a small amount of data.

You could also use SparkR:::broadcast but please keep in mind that is not 
public API we actively support.

Thank you for the information on formula - I will test that out. Please note 
that SparkR code is now at

https://github.com/apache/spark/tree/master/R
_
From: Cinquegrana, Piero 
>
Sent: Thursday, August 25, 2016 6:08 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: >, Felix Cheung 
>


I tested both in local and cluster mode and the ‘<<-‘ seemed to work at least 
for small data. Or am I missing something? Is there a way for me to test? If 
that does not work, can I use something like this?

sc <- SparkR:::getSparkContext()
bcStack <- SparkR:::broadcast(sc,stack)

I realized that the error: Error in writeBin(batch, con, endian = "big")

Was due to an object within the ‘parameters’ list which was a R formula.

When the spark.lapply method calls the parallelize method, it splits the list 
and calls the SparkR:::writeRaw method, which tries to convert from formula to 
binary exploding the size of the object being passed.

https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/R/serialize.R

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, August 25, 2016 2:35 PM
To: Cinquegrana, Piero 
>; 
user@spark.apache.org
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
>, Felix 
Cheung >, 
>



Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment ‘<<-‘ 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv(‘file.csv’)
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung 
>;user@spark.apache.org
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
>;user@spark.apache.org
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- 

RE: How to output RDD to one file with specific name?

2016-08-25 Thread Stefan Panayotov
You can do something like:

 

dbutils.fs.cp("/foo/part-0","/foo/my-data.csv")

 

Stefan Panayotov, PhD

  spanayo...@outlook.com

  spanayo...@comcast.net

Cell: 610-517-5586

Home: 610-355-0919

 

From: Gavin Yue [mailto:yue.yuany...@gmail.com] 
Sent: Thursday, August 25, 2016 1:15 PM
To: user 
Subject: How to output RDD to one file with specific name?

 

I am trying to output RDD to disk by

 

rdd.coleasce(1).saveAsTextFile("/foo")

 

It outputs to foo folder with a file with name: Part-0. 

 

Is there a way I could directly save the file as /foo/somename ?

 

Thanks. 

 



Re: UDF on lpad

2016-08-25 Thread Mich Talebzadeh
Thanks Mike.

Can one turn the first example into a generic UDF similar to the output
from below where 10 "0" are padded to the left of 123

  def padString(id: Int, chars: String, length: Int): String =
 (0 until length).map(_ =>
chars(Random.nextInt(chars.length))).mkString + id.toString

scala> padString(123, "0", 10)
res6: String = 00123

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 17:29, Mike Metzger  wrote:

> Are you trying to always add x numbers of digits / characters or are you
> trying to pad to a specific length?  If the latter, try using format
> strings:
>
> // Pad to 10 0 characters
> val c = 123
> f"$c%010d"
>
> // Result is 000123
>
>
> // Pad to 10 total characters with 0's
> val c = 123.87
> f"$c%010.2f"
>
> // Result is 123.87
>
>
> You can also do inline operations on the values before formatting.  I've
> used this specifically to pad for hex digits from strings.
>
> val d = "100"
> val hexstring = f"0x${d.toInt}%08X"
>
> // hexstring is 0x0064
>
>
> Thanks
>
> Mike
>
> On Thu, Aug 25, 2016 at 9:27 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Ok I tried this
>>
>> def padString(s: String, chars: String, length: Int): String =
>>  |  (0 until length).map(_ => 
>> chars(Random.nextInt(chars.length))).mkString
>> + s
>>
>> padString: (s: String, chars: String, length: Int)String
>> And use it like below:
>>
>> Example left pad the figure 12345.87 with 10 "0"s
>>
>> padString("12345.87", "0", 10)
>> res79: String = 0012345.87
>>
>> Any better way?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 12:06, Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> This UDF on substring works
>>>
>>> scala> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
>>> s.substring(start, end) }
>>> SubstrUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
>>> UserDefinedFunction(,StringType,Some(List(StringType,
>>> IntegerType, IntegerType)))
>>>
>>> I want something similar to this
>>>
>>> scala> sql("""select lpad("str", 10, "0")""").show
>>> ++
>>> |lpad(str, 10, 0)|
>>> ++
>>> |  000str|
>>> ++
>>>
>>> scala> val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
>>> lpad(s, len, chars) }
>>> :40: error: type mismatch;
>>>  found   : String
>>>  required: org.apache.spark.sql.Column
>>>val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
>>> lpad(s, len, chars) }
>>>
>>>
>>> Any ideas?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>


How to output RDD to one file with specific name?

2016-08-25 Thread Gavin Yue
I am trying to output RDD to disk by

rdd.coleasce(1).saveAsTextFile("/foo")

It outputs to foo folder with a file with name: Part-0.

Is there a way I could directly save the file as /foo/somename ?

Thanks.


Caching broadcasted DataFrames?

2016-08-25 Thread Jestin Ma
I have a DataFrame d1 that I would like to join with two separate
DataFrames.
Since d1 is small enough, I broadcast it.

What I understand about cache vs broadcast is that cache leads to each
executor storing the partitions its assigned in memory (cluster-wide
in-memory). Broadcast leads to each node (with multiple executors) storing
a copy of the dataset (all partitions) inside its own memory.

Since the dataset for d1 is used in two separate joins, should I also
persist it to prevent reading it from disk again? Or would broadcasting the
data already take care of that?


Thank you,
Jestin


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework.  There are many ways to give it
data to chomp down on.  If you don't know why you would need HDFS, then you
don't need it.  Same goes for Zookeeper.  Spark works fine without either.

Much of what we read online comes from people with specialized problems and
requirements (such as maintaining a 'highly available' service, or
accessing an existing HDFS).  It can be extremely confusing to the dude who
just needs to do some parallel computing.

Pete

On Wed, Aug 24, 2016 at 3:54 PM, kant kodali  wrote:

> What do I loose if I run spark without using HDFS or Zookeper ? which of
> them is almost a must in practice?
>


Re: UDF on lpad

2016-08-25 Thread Mike Metzger
Are you trying to always add x numbers of digits / characters or are you
trying to pad to a specific length?  If the latter, try using format
strings:

// Pad to 10 0 characters
val c = 123
f"$c%010d"

// Result is 000123


// Pad to 10 total characters with 0's
val c = 123.87
f"$c%010.2f"

// Result is 123.87


You can also do inline operations on the values before formatting.  I've
used this specifically to pad for hex digits from strings.

val d = "100"
val hexstring = f"0x${d.toInt}%08X"

// hexstring is 0x0064


Thanks

Mike

On Thu, Aug 25, 2016 at 9:27 AM, Mich Talebzadeh 
wrote:

> Ok I tried this
>
> def padString(s: String, chars: String, length: Int): String =
>  |  (0 until length).map(_ => 
> chars(Random.nextInt(chars.length))).mkString
> + s
>
> padString: (s: String, chars: String, length: Int)String
> And use it like below:
>
> Example left pad the figure 12345.87 with 10 "0"s
>
> padString("12345.87", "0", 10)
> res79: String = 0012345.87
>
> Any better way?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 12:06, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> This UDF on substring works
>>
>> scala> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
>> s.substring(start, end) }
>> SubstrUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
>> UserDefinedFunction(,StringType,Some(List(StringType,
>> IntegerType, IntegerType)))
>>
>> I want something similar to this
>>
>> scala> sql("""select lpad("str", 10, "0")""").show
>> ++
>> |lpad(str, 10, 0)|
>> ++
>> |  000str|
>> ++
>>
>> scala> val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
>> lpad(s, len, chars) }
>> :40: error: type mismatch;
>>  found   : String
>>  required: org.apache.spark.sql.Column
>>val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
>> lpad(s, len, chars) }
>>
>>
>> Any ideas?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: Incremental Updates and custom SQL via JDBC

2016-08-25 Thread Mich Talebzadeh
As far as I can tell Spark does not support update to ORC tables.

This is because Spark needs to send heartbeat to Hive metadata and maintain
in throughout DML transaction operation (delete, updates here) and that is
not implemented.

For the same token if you have performed DML on ORC table in Hive itself
ending up with delta files, until compaction (rolling delta files into main
files) is complete, Spark won't be able to read the ORC data!

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 00:54, Sascha Schmied 
wrote:

> Thank you for your answer.
>
> I’m using ORC transactional table right now. But i’m not stuck with that.
> When I send an SQL statement like the following, where old_5sek_agg and
> new_5sek_agg are registered temp tables, I’ll get an exception in spark.
> Same without subselect.
>
> sqlContext.sql("DELETE FROM old_5sek_agg WHERE Sec in (SELECT Sec FROM
> new_5sek_agg)")
>
> When I execute the statement directly in hive ambari view, I don’t get
> exceptions, indeed I get a success info, but the pointed row won’t be
> deleted or updated by UPDATE statement.
>
> I’m not familiar with your op_type and op_time approach and couldn’t find
> any useful resources by quickly asking google, but it sounds promising.
> Unfortunately your answer seems to be cut off in the middle of your
> example. Would you really update the value of those two additional columns
> and if so, how would you do this when it’s not a ORC transactional table.
>
> Thanks again!
>
> Am 25.08.2016 um 01:24 schrieb Mich Talebzadeh  >:
>
> Dr Mich Talebzadeh
>
>
>


Perform an ALS with TF-IDF output (spark 2.0)

2016-08-25 Thread Pasquinell Urbani
Hi there

I am performing a product recommendation system for retail. I have been
able to compute the TF-IDF of user-items data frame in spark 2.0.

Now I need to transform the TF-IDF output in a data frame with columns
(user_id, item_id, TF_IDF_ratings) in order to perform an ALS. But I have
no clue how to do it.

Can anybody give me some help?

Thank you all.


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Mich Talebzadeh
You can use Spark on Oracle as a query tool.

It all depends on the mode of the operation.

If you running Spark with yarn-client/cluster then you will need yarn. It
comes as part of Hadoop core (HDFS, Map-reduce and Yarn).

I have not gone and installed Yarn without installing Hadoop.

What is the overriding reason to have the Spark on its own?

 You can use Spark in Local or Standalone mode if you do not want Hadoop
core.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 August 2016 at 21:54, kant kodali  wrote:

> What do I loose if I run spark without using HDFS or Zookeper ? which of
> them is almost a must in practice?
>


SparkStreaming + Flume: org.jboss.netty.channel.ChannelException: Failed to bind to: master60/10.0.10.60:31001

2016-08-25 Thread luohui20001
Hi there  I have a flume cluster sending messages to SparkStreaming. I got 
an exception like below:16/08/25 23:00:54 ERROR ReceiverTracker: Deregistered 
receiver for stream 0: Error starting receiver 0 - 
org.jboss.netty.channel.ChannelException: Failed to bind to: 
master60/10.0.10.60:31001
at 
org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.avro.ipc.NettyServer.(NettyServer.java:106)
at org.apache.avro.ipc.NettyServer.(NettyServer.java:119)
at org.apache.avro.ipc.NettyServer.(NettyServer.java:74)
at org.apache.avro.ipc.NettyServer.(NettyServer.java:68)
at 
org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:162)
at 
org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:169)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
at 
org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at 
org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at 
org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:372)
at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:296)
at 
org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
... 3 more


I checked the port like below:[hadoop@master60 shellscripts]$ netstat -an|grep 
31001
tcp0  0 :::31001:::*
LISTEN
tcp0  0 :::10.0.10.60:31001 :::10.0.30.199:34773
ESTABLISHED

It seems that the workers on other nodes(except the one on master60) could not 
connect the port 31001 on master60 However I tried[hadoop@slave61 ~]$ telnet 
master60 31001
Trying 10.0.10.60...
Connected to master60.
Escape character is '^]'.
So I didn't quit get why the executors can not build connections?BTW, I am 
using Spark1.6.1, flume 1.6.0
Any idea will be appreciated.





 

ThanksBest regards!
San.Luo


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Steve Loughran


With Hadoop 2.7 or later, set

spark.hadooop.mapreduce.fileoutputcommitter.algorithm.version  2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true

This switches to a no -rename version of the file output committer, is faster 
all round. You are still at risk of things going wrong on failure though, and 
with speculation enabled.


you are still at risk o

On 25 Aug 2016, at 13:16, Tal Grynbaum 
> wrote:

Is/was there an option similar to DirectParquetOutputCommitter to write json 
files to S3 ?

On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro 
> wrote:
Hi,

Seems this just prevents writers from leaving partial data in a destination dir 
when jobs fail.
In the previous versions of Spark, there was a way to directly write data in a 
destination though,
Spark v2.0+ has no way to do that because of the critial issue on S3 (See: 
SPARK-10063).

// maropu


On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
> wrote:

I read somewhere that its because s3 has to know the size of the file upfront
I dont really understand this,  as to why is it ok  not to know it for the temp 
files and not ok for the final files.
The delete permission is the minor disadvantage from my side,  the worst thing 
is that i have a cluster of 100 machines sitting idle for 15 minutes waiting 
for copy to end.

Any suggestions how to avoid that?

On Thu, Aug 25, 2016, 08:21 Aseem Bansal 
> wrote:
Hi

When Spark saves anything to S3 it creates temporary files. Why? Asking this as 
this requires the the access credentails to be given delete permissions along 
with write permissions.



--
---
Takeshi Yamamuro



--
Tal Grynbaum / CTO & co-founder

m# +972-54-7875797


mobile retention done right



Pyspark SQL 1.6.0 write problem

2016-08-25 Thread Ethan Aubin
Hi, I'm having problems writing dataframes with pyspark 1.6.0.  If I create
a small dataframe like:

sqlContext.createDataFrame(pandas.DataFrame.from_dict([{'x':
1}])).write.orc('test-orc')

Only the _SUCCESS file in the output directory is written. The executor log
shows the saved output of the task being written under test-orc/_temporary/.

Writing with parquet rather than orc, I have the same output (a _SUCCESS
file, no parts), but there's also an exception

java.lang.NullPointerException
at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(Par
quetFileWriter.java:456)

matching "Writing empty Dataframes doesn't save any _metadata files"
https://issues.apache.org/jira/browse/SPARK-15393

If I do the equivalent in Scala, things work as expected. Any suggestions
what could be happening? Much appreciated --Ethan


Re: Kafka message metadata with Dstreams

2016-08-25 Thread Cody Koeninger
http://spark.apache.org/docs/latest/api/java/index.html

messageHandler receives a kafka MessageAndMetadata object.

Alternatively, if you just need metadata information on a
per-partition basis, you can use HasOffsetRanges

http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers


On Thu, Aug 25, 2016 at 6:45 AM, Pradeep  wrote:
> Hi All,
>
> I am using Dstreams to read Kafka topics. While I can read the messages fine, 
> I also want to get metadata on the message such as offset, time it was put to 
> topic etc.. Is there any Java Api to get this info.
>
> Thanks,
> Pradeep
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: UDF on lpad

2016-08-25 Thread Mich Talebzadeh
Ok I tried this

def padString(s: String, chars: String, length: Int): String =
 |  (0 until length).map(_ =>
chars(Random.nextInt(chars.length))).mkString + s

padString: (s: String, chars: String, length: Int)String
And use it like below:

Example left pad the figure 12345.87 with 10 "0"s

padString("12345.87", "0", 10)
res79: String = 0012345.87

Any better way?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 12:06, Mich Talebzadeh 
wrote:

> Hi,
>
> This UDF on substring works
>
> scala> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
> s.substring(start, end) }
> SubstrUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(,StringType,Some(List(StringType,
> IntegerType, IntegerType)))
>
> I want something similar to this
>
> scala> sql("""select lpad("str", 10, "0")""").show
> ++
> |lpad(str, 10, 0)|
> ++
> |  000str|
> ++
>
> scala> val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
> lpad(s, len, chars) }
> :40: error: type mismatch;
>  found   : String
>  required: org.apache.spark.sql.Column
>val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
> lpad(s, len, chars) }
>
>
> Any ideas?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
Hi,

This package is not dependant on any spefic Spark release and can be used
with 1.5 . Please refer to "How To" section here :

https://spark-packages.org/package/dibbhatt/kafka-spark-consumer

Also you will find more information in readme file how to use this package.

Regards,
Dibyendu


On Thu, Aug 25, 2016 at 7:01 PM,  wrote:

> Hi Dibyendu,
>
> Looks like it is available in 2.0, we are using older version of spark 1.5
> . Could you please let me know how to use this with older versions.
>
> Thanks,
> Asmath
>
> Sent from my iPhone
>
> On Aug 25, 2016, at 6:33 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
> Hi ,
>
> Released latest version of Receiver based Kafka Consumer for Spark
> Streaming.
>
> Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All
> Spark Versions
>
> Available at Spark Packages : https://spark-packages.org/
> package/dibbhatt/kafka-spark-consumer
>
> Also at github  : https://github.com/dibbhatt/kafka-spark-consumer
>
> Salient Features :
>
>- End to End No Data Loss without Write Ahead Log
>- ZK Based offset management for both consumed and processed offset
>- No dependency on WAL and Checkpoint
>- In-built PID Controller for Rate Limiting and Backpressure management
>- Custom Message Interceptor
>
> Please refer to https://github.com/dibbhatt/kafka-spark-consumer/
> blob/master/README.md for more details
>
>
> Regards,
>
> Dibyendu
>
>
>


Re: Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread mdkhajaasmath
Hi Dibyendu,

Looks like it is available in 2.0, we are using older version of spark 1.5 . 
Could you please let me know how to use this with older versions.

Thanks,
Asmath

Sent from my iPhone

> On Aug 25, 2016, at 6:33 AM, Dibyendu Bhattacharya 
>  wrote:
> 
> Hi , 
> 
> Released latest version of Receiver based Kafka Consumer for Spark Streaming.
> 
> Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All 
> Spark Versions
> 
> Available at Spark Packages : 
> https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
> 
> Also at github  : https://github.com/dibbhatt/kafka-spark-consumer
> 
> Salient Features :
> 
> End to End No Data Loss without Write Ahead Log
> ZK Based offset management for both consumed and processed offset
> No dependency on WAL and Checkpoint
> In-built PID Controller for Rate Limiting and Backpressure management
> Custom Message Interceptor
> Please refer to 
> https://github.com/dibbhatt/kafka-spark-consumer/blob/master/README.md for 
> more details
> 
> Regards, 
> Dibyendu
> 


RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Cinquegrana, Piero
I tested both in local and cluster mode and the '<<-' seemed to work at least 
for small data. Or am I missing something? Is there a way for me to test? If 
that does not work, can I use something like this?

sc <- SparkR:::getSparkContext()
bcStack <- SparkR:::broadcast(sc,stack)

I realized that the error: Error in writeBin(batch, con, endian = "big")

Was due to an object within the 'parameters' list which was a R formula.

When the spark.lapply method calls the parallelize method, it splits the list 
and calls the SparkR:::writeRaw method, which tries to convert from formula to 
binary exploding the size of the object being passed.

https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/R/serialize.R

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, August 25, 2016 2:35 PM
To: Cinquegrana, Piero ; user@spark.apache.org
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
>, Felix 
Cheung >, 
>



Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment '<<-' 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv('file.csv')
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung >; 
user@spark.apache.org
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
>;user@spark.apache.org
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution /Data Science
Mobile:+39.329.17.62.539/www.neustar.biz
Reduceyour environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook
   [New%20Picture%20(1)(1)]  

Re: quick question

2016-08-25 Thread Sivakumaran S
I am assuming that you are doing some calculations over a time window. At the 
end of the calculations (using RDDs or SQL), once you have collected the data 
back to the driver program, you format the data in the way your client 
(dashboard) requires it and write it to the websocket. 

Is your driver code in Python? The link Kevin has sent should start you off.

Regards,

Sivakumaran 
> On 25-Aug-2016, at 11:53 AM, kant kodali  wrote:
> 
> yes for now it will be Spark Streaming Job but later it may change.
> 
> 
> 
> 
> 
> On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com 
>  wrote:
> Is this a Spark Streaming job?
> 
> Regards,
> 
> Sivakumaran S
> 
> 
>> @Sivakumaran when you say create a web socket object in your spark code I 
>> assume you meant a spark "task" opening websocket 
>> connection from one of the worker machines to some node.js server in that 
>> case the websocket connection terminates after the spark 
>> task is completed right ? and when new data comes in a new task gets created 
>> and opens a new websocket connection again…is that how it should be
> 
>> On 25-Aug-2016, at 7:08 AM, kant kodali > > wrote:
>> 
>> @Sivakumaran when you say create a web socket object in your spark code I 
>> assume you meant a spark "task" opening websocket connection from one of the 
>> worker machines to some node.js server in that case the websocket connection 
>> terminates after the spark task is completed right ? and when new data comes 
>> in a new task gets created and opens a new websocket connection again…is 
>> that how it should be?
>> 
>> 
>> 
>> 
>> 
>> On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com 
>>  wrote:
>> You create a websocket object in your spark code and write your data to the 
>> socket. You create a websocket object in your dashboard code and receive the 
>> data in realtime and update the dashboard. You can use Node.js in your 
>> dashboard (socket.io ). I am sure there are other ways 
>> too.
>> 
>> Does that help?
>> 
>> Sivakumaran S
>> 
>>> On 25-Aug-2016, at 6:30 AM, kant kodali >> > wrote:
>>> 
>>> so I would need to open a websocket connection from spark worker machine to 
>>> where?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com 
>>>  wrote:
>>> In the diagram you referenced, a real-time dashboard can be created using 
>>> WebSockets. This technology essentially allows your web page to keep an 
>>> active line of communication between the client and server, in which case 
>>> you can detect and display new information without requiring any user input 
>>> of page refreshes. The link below contains additional information on this 
>>> concept, as well as links to several different implementations (based on 
>>> your programming language preferences).
>>> 
>>> https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API 
>>> 
>>> 
>>> Hope this helps!
>>> - Kevin
>>> 
>>> On Wed, Aug 24, 2016 at 3:52 PM, kant kodali >> > wrote:
>>> 
>>> -- Forwarded message --
>>> From: kant kodali >
>>> Date: Wed, Aug 24, 2016 at 1:49 PM
>>> Subject: quick question
>>> To: d...@spark.apache.org , 
>>> us...@spark.apache.org 
>>> 
>>> 
>>> 
>>> 
>>> In this picture what does "Dashboards" really mean? is there a open source 
>>> project which can allow me to push the results back to Dashboards such that 
>>> Dashboards are always in sync with real time updates? (a push based 
>>> solution is better than poll but i am open to whatever is possible given 
>>> the above picture)
>>> 
>> 
> 
> 



Re: namespace quota not take effect

2016-08-25 Thread Ted Yu
This question should have been posted to user@

Looks like you were using wrong config.
See:
http://hbase.apache.org/book.html#quota

See 'Setting Namespace Quotas' section further down.

Cheers

On Tue, Aug 23, 2016 at 11:38 PM, W.H  wrote:

> hi guys
>   I am testing the hbase  namespace quota at the maxTables and
> maxRegions.Followed the guide  i add the option  "hbase.quota.enabled" with
> value "true" in the hbase-site.xml .And then created the namespace :
>hbase(main):003:0> describe_namespace 'ns1'
>DESCRIPTION
>   {NAME => 'ns1', maxregions => '2', maxtables => '1'}
>
>   In the table definition a limited the maxtables as 1 ,but i created 5
> tables under the namespace "ns1".It seems the quota  did't take effect .
>   The hbase cluster was restarted after  hbase-site.xml modified.And my
> hbase version is 1.1.2.2.4.
>   Any ideas ?Thanks .
>
>
>
>  Best wishes.
>  who.cat


Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
>, Felix 
Cheung >, 
>


Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment ‘<<-‘ 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv(‘file.csv’)
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung >; 
user@spark.apache.org
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
>;user@spark.apache.org
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution /Data Science
Mobile:+39.329.17.62.539/www.neustar.biz
Reduceyour environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook
   [New%20Picture%20(1)(1)]  
LinkedIn
   [New%20Picture%20(2)]  
Twitter
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us 

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
afaik no.

// maropu

On Thu, Aug 25, 2016 at 9:16 PM, Tal Grynbaum 
wrote:

> Is/was there an option similar to DirectParquetOutputCommitter to write
> json files to S3 ?
>
> On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> Seems this just prevents writers from leaving partial data in a
>> destination dir when jobs fail.
>> In the previous versions of Spark, there was a way to directly write data
>> in a destination though,
>> Spark v2.0+ has no way to do that because of the critial issue on S3
>> (See: SPARK-10063).
>>
>> // maropu
>>
>>
>> On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
>> wrote:
>>
>>> I read somewhere that its because s3 has to know the size of the file
>>> upfront
>>> I dont really understand this,  as to why is it ok  not to know it for
>>> the temp files and not ok for the final files.
>>> The delete permission is the minor disadvantage from my side,  the worst
>>> thing is that i have a cluster of 100 machines sitting idle for 15 minutes
>>> waiting for copy to end.
>>>
>>> Any suggestions how to avoid that?
>>>
>>> On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:
>>>
 Hi

 When Spark saves anything to S3 it creates temporary files. Why? Asking
 this as this requires the the access credentails to be given
 delete permissions along with write permissions.

>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> *Tal Grynbaum* / *CTO & co-founder*
>
> m# +972-54-7875797
>
> mobile retention done right
>



-- 
---
Takeshi Yamamuro


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Tal Grynbaum
Is/was there an option similar to DirectParquetOutputCommitter to write
json files to S3 ?

On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Seems this just prevents writers from leaving partial data in a
> destination dir when jobs fail.
> In the previous versions of Spark, there was a way to directly write data
> in a destination though,
> Spark v2.0+ has no way to do that because of the critial issue on S3 (See:
> SPARK-10063).
>
> // maropu
>
>
> On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
> wrote:
>
>> I read somewhere that its because s3 has to know the size of the file
>> upfront
>> I dont really understand this,  as to why is it ok  not to know it for
>> the temp files and not ok for the final files.
>> The delete permission is the minor disadvantage from my side,  the worst
>> thing is that i have a cluster of 100 machines sitting idle for 15 minutes
>> waiting for copy to end.
>>
>> Any suggestions how to avoid that?
>>
>> On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:
>>
>>> Hi
>>>
>>> When Spark saves anything to S3 it creates temporary files. Why? Asking
>>> this as this requires the the access credentails to be given
>>> delete permissions along with write permissions.
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-25 Thread Takeshi Yamamuro
Hi,

Seems this just prevents writers from leaving partial data in a destination
dir when jobs fail.
In the previous versions of Spark, there was a way to directly write data
in a destination though,
Spark v2.0+ has no way to do that because of the critial issue on S3 (See:
SPARK-10063).

// maropu


On Thu, Aug 25, 2016 at 2:40 PM, Tal Grynbaum 
wrote:

> I read somewhere that its because s3 has to know the size of the file
> upfront
> I dont really understand this,  as to why is it ok  not to know it for the
> temp files and not ok for the final files.
> The delete permission is the minor disadvantage from my side,  the worst
> thing is that i have a cluster of 100 machines sitting idle for 15 minutes
> waiting for copy to end.
>
> Any suggestions how to avoid that?
>
> On Thu, Aug 25, 2016, 08:21 Aseem Bansal  wrote:
>
>> Hi
>>
>> When Spark saves anything to S3 it creates temporary files. Why? Asking
>> this as this requires the the access credentails to be given
>> delete permissions along with write permissions.
>>
>


-- 
---
Takeshi Yamamuro


Kafka message metadata with Dstreams

2016-08-25 Thread Pradeep
Hi All,

I am using Dstreams to read Kafka topics. While I can read the messages fine, I 
also want to get metadata on the message such as offset, time it was put to 
topic etc.. Is there any Java Api to get this info.

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
Hi ,

Released latest version of Receiver based Kafka Consumer for Spark
Streaming.

Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All
Spark Versions

Available at Spark Packages :
https://spark-packages.org/package/dibbhatt/kafka-spark-consumer

Also at github  : https://github.com/dibbhatt/kafka-spark-consumer

Salient Features :

   - End to End No Data Loss without Write Ahead Log
   - ZK Based offset management for both consumed and processed offset
   - No dependency on WAL and Checkpoint
   - In-built PID Controller for Rate Limiting and Backpressure management
   - Custom Message Interceptor

Please refer to
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/README.md for
more details


Regards,

Dibyendu


UDF on lpad

2016-08-25 Thread Mich Talebzadeh
Hi,

This UDF on substring works

scala> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
s.substring(start, end) }
SubstrUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(,StringType,Some(List(StringType,
IntegerType, IntegerType)))

I want something similar to this

scala> sql("""select lpad("str", 10, "0")""").show
++
|lpad(str, 10, 0)|
++
|  000str|
++

scala> val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
lpad(s, len, chars) }
:40: error: type mismatch;
 found   : String
 required: org.apache.spark.sql.Column
   val SubstrUDF = udf { (s: String, len: Int, chars: String) =>
lpad(s, len, chars) }


Any ideas?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: quick question

2016-08-25 Thread kant kodali
yes for now it will be Spark Streaming Job but later it may change.





On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com wrote:
Is this a Spark Streaming job?
Regards,
Sivakumaran S

@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker 
machines to some node.js server in that case
the websocket connection terminates after the spark task is completed right ? 
and when new data comes in a new task gets created
and opens a new websocket connection again…is that how it should be
On 25-Aug-2016, at 7:08 AM, kant kodali < kanth...@gmail.com > wrote:
@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker
machines to some node.js server in that case the websocket connection terminates
after the spark task is completed right ? and when new data comes in a new task
gets created and opens a new websocket connection again…is that how it should
be?





On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com wrote:
You create a websocket object in your spark code and write your data to the
socket. You create a websocket object in your dashboard code and receive the
data in realtime and update the dashboard. You can use Node.js in your dashboard
( socket.io ). I am sure there are other ways too.
Does that help?
Sivakumaran S
On 25-Aug-2016, at 6:30 AM, kant kodali < kanth...@gmail.com > wrote:
so I would need to open a websocket connection from spark worker machine to
where?





On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com wrote:
In the diagram you referenced, a real-time dashboard can be created using
WebSockets. This technology essentially allows your web page to keep an active
line of communication between the client and server, in which case you can
detect and display new information without requiring any user input of page
refreshes. The link below contains additional information on this concept, as
well as links to several different implementations (based on your programming
language preferences).
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API

Hope this helps! - Kevin
On Wed, Aug 24, 2016 at 3:52 PM, kant kodali < kanth...@gmail.com > wrote:

-- Forwarded message --
From: kant kodali < kanth...@gmail.com >
Date: Wed, Aug 24, 2016 at 1:49 PM
Subject: quick question
To: d...@spark.apache.org , us...@spark.apache.org



In this picture what does "Dashboards" really mean? is there a open source
project which can allow me to push the results back to Dashboards such that
Dashboards are always in sync with real time updates? (a push based solution is
better than poll but i am open to whatever is possible given the above picture)

Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
Hi,

I am using Hadoop 2.6

hduser@rhes564: /home/hduser/dba/bin>
*hadoop version*Hadoop 2.6.0

Thanks






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 11:48, Bhaskar Dutta  wrote:

> This constant was added in Hadoop 2.3. Maybe you are using an older
> version?
>
> ~bhaskar
>
> On Thu, Aug 25, 2016 at 3:04 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Actually I started using Spark to import data from RDBMS (in this case
>> Oracle) after upgrading to Hive 2, running an import like below
>>
>> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12"
>> --username scratchpad -P \
>> --query "select * from scratchpad.dummy2 where \
>>  \$CONDITIONS" \
>>   --split-by ID \
>>--hive-import  --hive-table "test.dumy2" --target-dir
>> "/tmp/dummy2" *--direct*
>>
>> This gets the data into HDFS and then throws this error
>>
>> ERROR [main] tool.ImportTool: Imported Failed: No enum constant
>> org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
>>
>> I can easily get the data into Hive from the file on HDFS or dig into the
>> problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark trouble
>> free like below
>>
>>  val df = HiveContext.read.format("jdbc").options(
>>  Map("url" -> dbURL,
>>  "dbtable" -> "scratchpad.dummy)",
>>  "partitionColumn" -> partitionColumnName,
>>  "lowerBound" -> lowerBoundValue,
>>  "upperBound" -> upperBoundValue,
>>  "numPartitions" -> numPartitionsValue,
>>  "user" -> dbUserName,
>>  "password" -> dbPassword)).load
>>
>> It does work, opens parallel connections to Oracle DB and creates DF with
>> the specified number of partitions.
>>
>> One thing I am not sure or tried if Spark supports direct mode yet.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 09:07, Bhaskar Dutta  wrote:
>>
>>> Which RDBMS are you using here, and what is the data volume and
>>> frequency of pulling data off the RDBMS?
>>> Specifying these would help in giving better answers.
>>>
>>> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and
>>> Oracle, so you can use that for better performance if using one of these
>>> databases.
>>>
>>> And don't forget that you Sqoop can load data directly into Parquet or
>>> Avro (I think direct mode is not supported in this case).
>>> Also you can use Kite SDK with Sqoop to manage/transform datasets,
>>> perform schema evolution and such.
>>>
>>> ~bhaskar
>>>
>>>
>>> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
>>> mail.venkatakart...@gmail.com> wrote:
>>>
 Team,
 Please help me in choosing sqoop or spark jdbc to fetch data from
 rdbms. Sqoop has lot of optimizations to fetch data does spark jdbc also
 has those ?

 I'm performing few analytics using spark data for which data is
 residing in rdbms.

 Please guide me with this.


 Thanks
 Venkata Karthik P


>>>
>>
>


Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
This constant was added in Hadoop 2.3. Maybe you are using an older version?

~bhaskar

On Thu, Aug 25, 2016 at 3:04 PM, Mich Talebzadeh 
wrote:

> Actually I started using Spark to import data from RDBMS (in this case
> Oracle) after upgrading to Hive 2, running an import like below
>
> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username
> scratchpad -P \
> --query "select * from scratchpad.dummy2 where \
>  \$CONDITIONS" \
>   --split-by ID \
>--hive-import  --hive-table "test.dumy2" --target-dir
> "/tmp/dummy2" *--direct*
>
> This gets the data into HDFS and then throws this error
>
> ERROR [main] tool.ImportTool: Imported Failed: No enum constant
> org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
>
> I can easily get the data into Hive from the file on HDFS or dig into the
> problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark trouble
> free like below
>
>  val df = HiveContext.read.format("jdbc").options(
>  Map("url" -> dbURL,
>  "dbtable" -> "scratchpad.dummy)",
>  "partitionColumn" -> partitionColumnName,
>  "lowerBound" -> lowerBoundValue,
>  "upperBound" -> upperBoundValue,
>  "numPartitions" -> numPartitionsValue,
>  "user" -> dbUserName,
>  "password" -> dbPassword)).load
>
> It does work, opens parallel connections to Oracle DB and creates DF with
> the specified number of partitions.
>
> One thing I am not sure or tried if Spark supports direct mode yet.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 09:07, Bhaskar Dutta  wrote:
>
>> Which RDBMS are you using here, and what is the data volume and frequency
>> of pulling data off the RDBMS?
>> Specifying these would help in giving better answers.
>>
>> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and
>> Oracle, so you can use that for better performance if using one of these
>> databases.
>>
>> And don't forget that you Sqoop can load data directly into Parquet or
>> Avro (I think direct mode is not supported in this case).
>> Also you can use Kite SDK with Sqoop to manage/transform datasets,
>> perform schema evolution and such.
>>
>> ~bhaskar
>>
>>
>> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
>> mail.venkatakart...@gmail.com> wrote:
>>
>>> Team,
>>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>>> ?
>>>
>>> I'm performing few analytics using spark data for which data is residing
>>> in rdbms.
>>>
>>> Please guide me with this.
>>>
>>>
>>> Thanks
>>> Venkata Karthik P
>>>
>>>
>>
>


Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-25 Thread Marek Wiewiorka
Hi you can take a look at:
https://github.com/hammerlab/spree

it's a bit outdated but maybe it's still possible to use with some more
recent Spark version.

M.

2016-08-25 11:55 GMT+02:00 Mich Talebzadeh :

> Hi,
>
> This may be already there.
>
> A spark job opens up a UI on port specified by --conf
> "spark.ui.port=${SP}"  that defaults to 4040.
>
> However, on UI one needs to refresh the page to see the progress.
>
> Can this be polled so it is refreshed automatically
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Is there anyway Spark UI is set to poll and refreshes itself

2016-08-25 Thread Mich Talebzadeh
Hi,

This may be already there.

A spark job opens up a UI on port specified by --conf
"spark.ui.port=${SP}"  that defaults to 4040.

However, on UI one needs to refresh the page to see the progress.

Can this be polled so it is refreshed automatically

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Mich Talebzadeh
Hi Richard,

Windowing/Analytics for stats are pretty simple. Example

import org.apache.spark.sql.expressions.Window
val wSpec =
Window.partitionBy('transactiontype).orderBy(desc("transactiondate"))
df.filter('transactiondescription.contains(HASHTAG)).select('transactiondate,'transactiondescription,
*rank().over(wSpec).as("rank")).*filter($"rank"===1).show(1)

val wSpec5 =
Window.partitionBy('hashtag).orderBy(substring('transactiondate,1,4))
val newDF = df.where('transactiontype === "DEB" &&
('transactiondescription).isNotNull).select(substring('transactiondate,1,4).as("Year"),
'hashtag.as("Retailer"),*round(sum('debitamount).over(wSpec5),2).as("Spent"*
))
newDF.distinct.orderBy('year,'Retailer).collect.foreach(println)

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 08:24, Richard Siebeling  wrote:

> Hi Mich,
>
> thanks for the suggestion, I hadn't thought of that. We'll need to gather
> the statistics in two ways, incremental when new data arrives and over the
> complete set when aggregating or filtering (because I think it's difficult
> to gather statistics while aggregating or filtering).
> The analytic functions could help when gathering the statistics over the
> whole set,
>
> kind regards,
> Richard
>
>
>
> On Wed, Aug 24, 2016 at 10:54 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Richard,
>>
>> can you use analytics functions for this purpose on DF
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 21:37, Richard Siebeling 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> I'd like to gather several statistics per column in order to make
>>> analysing data easier. These two statistics are some examples, other
>>> statistics I'd like to gather are the variance, the median, several
>>> percentiles, etc.  We are building a data analysis platform based on Spark,
>>>
>>> kind regards,
>>> Richard
>>>
>>> On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Richard,

 What is the business use case for such statistics?

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 24 August 2016 at 16:01, Bedrytski Aliaksandr 
 wrote:

> Hi Richard,
>
> these intermediate statistics should be calculated from the result of
> the calculation or during the aggregation?
> If they can be derived from the resulting dataframe, why not to cache
> (persist) that result just after the calculation?
> Then you may aggregate statistics from the cached dataframe.
> This way it won't hit performance too much.
>
> Regards
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote:
>
> Hi,
>
> what is the best way to calculate intermediate column statistics like
> the number of empty values and the number of distinct values each column 
> in
> a dataset when aggregating of filtering data next to the actual result of
> the aggregate or the filtered data?
>
> We are developing an application in which the user can slice-and-dice
> through the data and we would like to, 

Re: quick question

2016-08-25 Thread Sivakumaran S
Is this a Spark Streaming job?

Regards,

Sivakumaran S


> @Sivakumaran when you say create a web socket object in your spark code I 
> assume you meant a spark "task" opening websocket 
> connection from one of the worker machines to some node.js server in that 
> case the websocket connection terminates after the spark 
> task is completed right ? and when new data comes in a new task gets created 
> and opens a new websocket connection again…is that how it should be

> On 25-Aug-2016, at 7:08 AM, kant kodali  wrote:
> 
> @Sivakumaran when you say create a web socket object in your spark code I 
> assume you meant a spark "task" opening websocket connection from one of the 
> worker machines to some node.js server in that case the websocket connection 
> terminates after the spark task is completed right ? and when new data comes 
> in a new task gets created and opens a new websocket connection again…is that 
> how it should be?
> 
> 
> 
> 
> 
> On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com 
>  wrote:
> You create a websocket object in your spark code and write your data to the 
> socket. You create a websocket object in your dashboard code and receive the 
> data in realtime and update the dashboard. You can use Node.js in your 
> dashboard (socket.io ). I am sure there are other ways too.
> 
> Does that help?
> 
> Sivakumaran S
> 
>> On 25-Aug-2016, at 6:30 AM, kant kodali > > wrote:
>> 
>> so I would need to open a websocket connection from spark worker machine to 
>> where?
>> 
>> 
>> 
>> 
>> 
>> On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com 
>>  wrote:
>> In the diagram you referenced, a real-time dashboard can be created using 
>> WebSockets. This technology essentially allows your web page to keep an 
>> active line of communication between the client and server, in which case 
>> you can detect and display new information without requiring any user input 
>> of page refreshes. The link below contains additional information on this 
>> concept, as well as links to several different implementations (based on 
>> your programming language preferences).
>> 
>> https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API 
>> 
>> 
>> Hope this helps!
>> - Kevin
>> 
>> On Wed, Aug 24, 2016 at 3:52 PM, kant kodali > > wrote:
>> 
>> -- Forwarded message --
>> From: kant kodali >
>> Date: Wed, Aug 24, 2016 at 1:49 PM
>> Subject: quick question
>> To: d...@spark.apache.org , 
>> us...@spark.apache.org 
>> 
>> 
>> 
>> 
>> In this picture what does "Dashboards" really mean? is there a open source 
>> project which can allow me to push the results back to Dashboards such that 
>> Dashboards are always in sync with real time updates? (a push based solution 
>> is better than poll but i am open to whatever is possible given the above 
>> picture)
>> 
> 



Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
Actually I started using Spark to import data from RDBMS (in this case
Oracle) after upgrading to Hive 2, running an import like below

sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username
scratchpad -P \
--query "select * from scratchpad.dummy2 where \
 \$CONDITIONS" \
  --split-by ID \
   --hive-import  --hive-table "test.dumy2" --target-dir
"/tmp/dummy2" *--direct*

This gets the data into HDFS and then throws this error

ERROR [main] tool.ImportTool: Imported Failed: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS

I can easily get the data into Hive from the file on HDFS or dig into the
problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark trouble
free like below

 val df = HiveContext.read.format("jdbc").options(
 Map("url" -> dbURL,
 "dbtable" -> "scratchpad.dummy)",
 "partitionColumn" -> partitionColumnName,
 "lowerBound" -> lowerBoundValue,
 "upperBound" -> upperBoundValue,
 "numPartitions" -> numPartitionsValue,
 "user" -> dbUserName,
 "password" -> dbPassword)).load

It does work, opens parallel connections to Oracle DB and creates DF with
the specified number of partitions.

One thing I am not sure or tried if Spark supports direct mode yet.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 09:07, Bhaskar Dutta  wrote:

> Which RDBMS are you using here, and what is the data volume and frequency
> of pulling data off the RDBMS?
> Specifying these would help in giving better answers.
>
> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and Oracle,
> so you can use that for better performance if using one of these databases.
>
> And don't forget that you Sqoop can load data directly into Parquet or
> Avro (I think direct mode is not supported in this case).
> Also you can use Kite SDK with Sqoop to manage/transform datasets, perform
> schema evolution and such.
>
> ~bhaskar
>
>
> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
> mail.venkatakart...@gmail.com> wrote:
>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>>
>


Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
Which RDBMS are you using here, and what is the data volume and frequency
of pulling data off the RDBMS?
Specifying these would help in giving better answers.

Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and Oracle,
so you can use that for better performance if using one of these databases.

And don't forget that you Sqoop can load data directly into Parquet or Avro
(I think direct mode is not supported in this case).
Also you can use Kite SDK with Sqoop to manage/transform datasets, perform
schema evolution and such.

~bhaskar

On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
mail.venkatakart...@gmail.com> wrote:

> Team,
> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
> ?
>
> I'm performing few analytics using spark data for which data is residing
> in rdbms.
>
> Please guide me with this.
>
>
> Thanks
> Venkata Karthik P
>
>


Re: Sqoop vs spark jdbc

2016-08-25 Thread Sean Owen
Sqoop is probably the more mature tool for the job. It also just does
one thing. The argument for doing it in Spark would be wanting to
integrate it with a larger workflow. I imagine Sqoop would be more
efficient and flexible for just the task of ingest, including
continuously pulling deltas which I am not sure Spark really does for
you.

MapReduce won't matter here. The bottleneck is reading from the RDBMS
in general.

On Wed, Aug 24, 2016 at 11:07 PM, Mich Talebzadeh
 wrote:
> Personally I prefer Spark JDBC.
>
> Both Sqoop and Spark rely on the same drivers.
>
> I think Spark is faster and if you have many nodes you can partition your
> incoming data and take advantage of Spark DAG + in memory offering.
>
> By default Sqoop will use Map-reduce which is pretty slow.
>
> Remember for Spark you will need to have sufficient memory
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 24 August 2016 at 22:39, Venkata Penikalapati
>  wrote:
>>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best way to calculate intermediate column statistics

2016-08-25 Thread Richard Siebeling
Hi Mich,

thanks for the suggestion, I hadn't thought of that. We'll need to gather
the statistics in two ways, incremental when new data arrives and over the
complete set when aggregating or filtering (because I think it's difficult
to gather statistics while aggregating or filtering).
The analytic functions could help when gathering the statistics over the
whole set,

kind regards,
Richard



On Wed, Aug 24, 2016 at 10:54 PM, Mich Talebzadeh  wrote:

> Hi Richard,
>
> can you use analytics functions for this purpose on DF
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 August 2016 at 21:37, Richard Siebeling 
> wrote:
>
>> Hi Mich,
>>
>> I'd like to gather several statistics per column in order to make
>> analysing data easier. These two statistics are some examples, other
>> statistics I'd like to gather are the variance, the median, several
>> percentiles, etc.  We are building a data analysis platform based on Spark,
>>
>> kind regards,
>> Richard
>>
>> On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Richard,
>>>
>>> What is the business use case for such statistics?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 August 2016 at 16:01, Bedrytski Aliaksandr 
>>> wrote:
>>>
 Hi Richard,

 these intermediate statistics should be calculated from the result of
 the calculation or during the aggregation?
 If they can be derived from the resulting dataframe, why not to cache
 (persist) that result just after the calculation?
 Then you may aggregate statistics from the cached dataframe.
 This way it won't hit performance too much.

 Regards
 --
   Bedrytski Aliaksandr
   sp...@bedryt.ski



 On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote:

 Hi,

 what is the best way to calculate intermediate column statistics like
 the number of empty values and the number of distinct values each column in
 a dataset when aggregating of filtering data next to the actual result of
 the aggregate or the filtered data?

 We are developing an application in which the user can slice-and-dice
 through the data and we would like to, next to the actual resulting data,
 get column statistics of each column in the resulting dataset. We prefer to
 calculate the column statistics on the same pass over the data as the
 actual aggregation or filtering, is that possible?

 We could sacrifice a little bit of performance (but not too much),
 that's why we prefer one pass...

 Is this possible in the standard Spark or would this mean modifying the
 source a little bit and recompiling? Is that feasible / wise to do?

 thanks in advance,
 Richard






>>>
>>
>


Re: quick question

2016-08-25 Thread kant kodali
@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker
machines to some node.js server in that case the websocket connection terminates
after the spark task is completed right ? and when new data comes in a new task
gets created and opens a new websocket connection again…is that how it should
be?





On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com wrote:
You create a websocket object in your spark code and write your data to the
socket. You create a websocket object in your dashboard code and receive the
data in realtime and update the dashboard. You can use Node.js in your dashboard
( socket.io ). I am sure there are other ways too.
Does that help?
Sivakumaran S
On 25-Aug-2016, at 6:30 AM, kant kodali < kanth...@gmail.com > wrote:
so I would need to open a websocket connection from spark worker machine to
where?





On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com wrote:
In the diagram you referenced, a real-time dashboard can be created using
WebSockets. This technology essentially allows your web page to keep an active
line of communication between the client and server, in which case you can
detect and display new information without requiring any user input of page
refreshes. The link below contains additional information on this concept, as
well as links to several different implementations (based on your programming
language preferences).
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API

Hope this helps! - Kevin
On Wed, Aug 24, 2016 at 3:52 PM, kant kodali < kanth...@gmail.com > wrote:

-- Forwarded message --
From: kant kodali < kanth...@gmail.com >
Date: Wed, Aug 24, 2016 at 1:49 PM
Subject: quick question
To: d...@spark.apache.org , us...@spark.apache.org



In this picture what does "Dashboards" really mean? is there a open source
project which can allow me to push the results back to Dashboards such that
Dashboards are always in sync with real time updates? (a push based solution is
better than poll but i am open to whatever is possible given the above picture)