Fwd: Adding metadata information to parquet files

2016-04-17 Thread Manivannan Selvadurai
Just a reminder!!

Hi All,

 I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2
and I'm looking for a way to store the source schema in the parquet file
like the way you get to store the avro schema as a metadata info when using
the AvroParquetWriter. Any help much appreciated.


Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
A HashPartitioner will indeed partition based on the key, but you
cannot know on *which* node that key will appear. Again, the RDD
partitions will not necessarily be distributed evenly across your
nodes because of the greedy scheduling of the first wave of tasks,
particularly if those tasks have durations less than the initial
executor delay. I recommend you look at your logs to verify if this is
happening to you.

Mike

On 4/18/16, Anuj Kumar  wrote:
> Good point Mike +1
>
> On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote:
>
>> When submitting a job with spark-submit, I've observed delays (up to
>> 1--2 seconds) for the executors to respond to the driver in order to
>> receive tasks in the first stage. The delay does not persist once the
>> executors have been synchronized.
>>
>> When the tasks are very short, as may be your case (relatively small
>> data and a simple map task like you have described), the 8 tasks in
>> your stage may be allocated to only 1 executor in 2 waves of 4, since
>> the second executor won't have responded to the master before the
>> first 4 tasks on the first executor have completed.
>>
>> To see if this is the cause in your particular case, you could try the
>> following to confirm:
>> 1. Examine the starting times of the tasks alongside their
>> executor
>> 2. Make a "dummy" stage execute before your real stages to
>> synchronize the executors by creating and materializing any random RDD
>> 3. Make the tasks longer, i.e. with some silly computational
>> work.
>>
>> Mike
>>
>>
>> On 4/17/16, Raghava Mutharaju  wrote:
>> > Yes its the same data.
>> >
>> > 1) The number of partitions are the same (8, which is an argument to
>> > the
>> > HashPartitioner). In the first case, these partitions are spread across
>> > both the worker nodes. In the second case, all the partitions are on
>> > the
>> > same node.
>> > 2) What resources would be of interest here? Scala shell takes the
>> default
>> > parameters since we use "bin/spark-shell --master " to run
>> the
>> > scala-shell. For the scala program, we do set some configuration
>> > options
>> > such as driver memory (12GB), parallelism is set to 8 and we use Kryo
>> > serializer.
>> >
>> > We are running this on Azure D3-v2 machines which have 4 cores and 14GB
>> > RAM.1 executor runs on each worker node. Following configuration
>> > options
>> > are set for the scala program -- perhaps we should move it to the spark
>> > config file.
>> >
>> > Driver memory and executor memory are set to 12GB
>> > parallelism is set to 8
>> > Kryo serializer is used
>> > Number of retainedJobs and retainedStages has been increased to check
>> them
>> > in the UI.
>> >
>> > What information regarding Spark Context would be of interest here?
>> >
>> > Regards,
>> > Raghava.
>> >
>> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar 
>> > wrote:
>> >
>> >> If the data file is same then it should have similar distribution of
>> >> keys.
>> >> Few queries-
>> >>
>> >> 1. Did you compare the number of partitions in both the cases?
>> >> 2. Did you compare the resource allocation for Spark Shell vs Scala
>> >> Program being submitted?
>> >>
>> >> Also, can you please share the details of Spark Context, Environment
>> >> and
>> >> Executors when you run via Scala program?
>> >>
>> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
>> >> m.vijayaragh...@gmail.com> wrote:
>> >>
>> >>> Hello All,
>> >>>
>> >>> We are using HashPartitioner in the following way on a 3 node cluster
>> (1
>> >>> master and 2 worker nodes).
>> >>>
>> >>> val u =
>> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
>> >>> Int)](line => { line.split("\\|") match { case Array(x, y) =>
>> >>> (y.toInt,
>> >>> x.toInt) } }).partitionBy(new
>> HashPartitioner(8)).setName("u").persist()
>> >>>
>> >>> u.count()
>> >>>
>> >>> If we run this from the spark shell, the data (52 MB) is split across
>> >>> the
>> >>> two worker nodes. But if we put this in a scala program and run it,
>> then
>> >>> all the data goes to only one node. We have run it multiple times,
>> >>> but
>> >>> this
>> >>> behavior does not change. This seems strange.
>> >>>
>> >>> Is there some problem with the way we use HashPartitioner?
>> >>>
>> >>> Thanks in advance.
>> >>>
>> >>> Regards,
>> >>> Raghava.
>> >>>
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> > Raghava
>> > http://raghavam.github.io
>> >
>>
>>
>> --
>> Thanks,
>> Mike
>>
>


-- 
Thanks,
Mike

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
Good point Mike +1

On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote:

> When submitting a job with spark-submit, I've observed delays (up to
> 1--2 seconds) for the executors to respond to the driver in order to
> receive tasks in the first stage. The delay does not persist once the
> executors have been synchronized.
>
> When the tasks are very short, as may be your case (relatively small
> data and a simple map task like you have described), the 8 tasks in
> your stage may be allocated to only 1 executor in 2 waves of 4, since
> the second executor won't have responded to the master before the
> first 4 tasks on the first executor have completed.
>
> To see if this is the cause in your particular case, you could try the
> following to confirm:
> 1. Examine the starting times of the tasks alongside their executor
> 2. Make a "dummy" stage execute before your real stages to
> synchronize the executors by creating and materializing any random RDD
> 3. Make the tasks longer, i.e. with some silly computational work.
>
> Mike
>
>
> On 4/17/16, Raghava Mutharaju  wrote:
> > Yes its the same data.
> >
> > 1) The number of partitions are the same (8, which is an argument to the
> > HashPartitioner). In the first case, these partitions are spread across
> > both the worker nodes. In the second case, all the partitions are on the
> > same node.
> > 2) What resources would be of interest here? Scala shell takes the
> default
> > parameters since we use "bin/spark-shell --master " to run
> the
> > scala-shell. For the scala program, we do set some configuration options
> > such as driver memory (12GB), parallelism is set to 8 and we use Kryo
> > serializer.
> >
> > We are running this on Azure D3-v2 machines which have 4 cores and 14GB
> > RAM.1 executor runs on each worker node. Following configuration options
> > are set for the scala program -- perhaps we should move it to the spark
> > config file.
> >
> > Driver memory and executor memory are set to 12GB
> > parallelism is set to 8
> > Kryo serializer is used
> > Number of retainedJobs and retainedStages has been increased to check
> them
> > in the UI.
> >
> > What information regarding Spark Context would be of interest here?
> >
> > Regards,
> > Raghava.
> >
> > On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar  wrote:
> >
> >> If the data file is same then it should have similar distribution of
> >> keys.
> >> Few queries-
> >>
> >> 1. Did you compare the number of partitions in both the cases?
> >> 2. Did you compare the resource allocation for Spark Shell vs Scala
> >> Program being submitted?
> >>
> >> Also, can you please share the details of Spark Context, Environment and
> >> Executors when you run via Scala program?
> >>
> >> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
> >> m.vijayaragh...@gmail.com> wrote:
> >>
> >>> Hello All,
> >>>
> >>> We are using HashPartitioner in the following way on a 3 node cluster
> (1
> >>> master and 2 worker nodes).
> >>>
> >>> val u =
> >>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
> >>> Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
> >>> x.toInt) } }).partitionBy(new
> HashPartitioner(8)).setName("u").persist()
> >>>
> >>> u.count()
> >>>
> >>> If we run this from the spark shell, the data (52 MB) is split across
> >>> the
> >>> two worker nodes. But if we put this in a scala program and run it,
> then
> >>> all the data goes to only one node. We have run it multiple times, but
> >>> this
> >>> behavior does not change. This seems strange.
> >>>
> >>> Is there some problem with the way we use HashPartitioner?
> >>>
> >>> Thanks in advance.
> >>>
> >>> Regards,
> >>> Raghava.
> >>>
> >>
> >>
> >
> >
> > --
> > Regards,
> > Raghava
> > http://raghavam.github.io
> >
>
>
> --
> Thanks,
> Mike
>


Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
We are testing with 52MB, but it would go to 20GB and more later on. The
cluster size is also not static, we would be growing it. But the issue here
is the behavior of HashPartitioner -- from what I understand, it should be
partitioning the data based on the hash of the key irrespective of the RAM
size (which is more than adequate now). This behavior is different in
spark-shell and spark scala program.

We are not using YARN, its the stand alone version of Spark.

Regards,
Raghava.


On Mon, Apr 18, 2016 at 12:09 AM, Anuj Kumar  wrote:

> Few params like- spark.task.cpus, spark.cores.max will help. Also, for
> 52MB of data you need not have 12GB allocated to executors. Better to
> assign 512MB or so and increase the number of executors per worker node.
> Try reducing that executor memory to 512MB or so for this case.
>
> On Mon, Apr 18, 2016 at 9:07 AM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Yes its the same data.
>>
>> 1) The number of partitions are the same (8, which is an argument to the
>> HashPartitioner). In the first case, these partitions are spread across
>> both the worker nodes. In the second case, all the partitions are on the
>> same node.
>> 2) What resources would be of interest here? Scala shell takes the
>> default parameters since we use "bin/spark-shell --master " to
>> run the scala-shell. For the scala program, we do set some configuration
>> options such as driver memory (12GB), parallelism is set to 8 and we use
>> Kryo serializer.
>>
>> We are running this on Azure D3-v2 machines which have 4 cores and 14GB
>> RAM.1 executor runs on each worker node. Following configuration options
>> are set for the scala program -- perhaps we should move it to the spark
>> config file.
>>
>> Driver memory and executor memory are set to 12GB
>> parallelism is set to 8
>> Kryo serializer is used
>> Number of retainedJobs and retainedStages has been increased to check
>> them in the UI.
>>
>> What information regarding Spark Context would be of interest here?
>>
>> Regards,
>> Raghava.
>>
>> On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar  wrote:
>>
>>> If the data file is same then it should have similar distribution of
>>> keys. Few queries-
>>>
>>> 1. Did you compare the number of partitions in both the cases?
>>> 2. Did you compare the resource allocation for Spark Shell vs Scala
>>> Program being submitted?
>>>
>>> Also, can you please share the details of Spark Context, Environment and
>>> Executors when you run via Scala program?
>>>
>>> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
>>> m.vijayaragh...@gmail.com> wrote:
>>>
 Hello All,

 We are using HashPartitioner in the following way on a 3 node cluster
 (1 master and 2 worker nodes).

 val u =
 sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
 Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
 x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()

 u.count()

 If we run this from the spark shell, the data (52 MB) is split across
 the two worker nodes. But if we put this in a scala program and run it,
 then all the data goes to only one node. We have run it multiple times, but
 this behavior does not change. This seems strange.

 Is there some problem with the way we use HashPartitioner?

 Thanks in advance.

 Regards,
 Raghava.

>>>
>>>
>>
>>
>> --
>> Regards,
>> Raghava
>> http://raghavam.github.io
>>
>
>


-- 
Regards,
Raghava
http://raghavam.github.io


Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
When submitting a job with spark-submit, I've observed delays (up to
1--2 seconds) for the executors to respond to the driver in order to
receive tasks in the first stage. The delay does not persist once the
executors have been synchronized.

When the tasks are very short, as may be your case (relatively small
data and a simple map task like you have described), the 8 tasks in
your stage may be allocated to only 1 executor in 2 waves of 4, since
the second executor won't have responded to the master before the
first 4 tasks on the first executor have completed.

To see if this is the cause in your particular case, you could try the
following to confirm:
1. Examine the starting times of the tasks alongside their executor
2. Make a "dummy" stage execute before your real stages to
synchronize the executors by creating and materializing any random RDD
3. Make the tasks longer, i.e. with some silly computational work.

Mike


On 4/17/16, Raghava Mutharaju  wrote:
> Yes its the same data.
>
> 1) The number of partitions are the same (8, which is an argument to the
> HashPartitioner). In the first case, these partitions are spread across
> both the worker nodes. In the second case, all the partitions are on the
> same node.
> 2) What resources would be of interest here? Scala shell takes the default
> parameters since we use "bin/spark-shell --master " to run the
> scala-shell. For the scala program, we do set some configuration options
> such as driver memory (12GB), parallelism is set to 8 and we use Kryo
> serializer.
>
> We are running this on Azure D3-v2 machines which have 4 cores and 14GB
> RAM.1 executor runs on each worker node. Following configuration options
> are set for the scala program -- perhaps we should move it to the spark
> config file.
>
> Driver memory and executor memory are set to 12GB
> parallelism is set to 8
> Kryo serializer is used
> Number of retainedJobs and retainedStages has been increased to check them
> in the UI.
>
> What information regarding Spark Context would be of interest here?
>
> Regards,
> Raghava.
>
> On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar  wrote:
>
>> If the data file is same then it should have similar distribution of
>> keys.
>> Few queries-
>>
>> 1. Did you compare the number of partitions in both the cases?
>> 2. Did you compare the resource allocation for Spark Shell vs Scala
>> Program being submitted?
>>
>> Also, can you please share the details of Spark Context, Environment and
>> Executors when you run via Scala program?
>>
>> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
>> m.vijayaragh...@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> We are using HashPartitioner in the following way on a 3 node cluster (1
>>> master and 2 worker nodes).
>>>
>>> val u =
>>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
>>> Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
>>> x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()
>>>
>>> u.count()
>>>
>>> If we run this from the spark shell, the data (52 MB) is split across
>>> the
>>> two worker nodes. But if we put this in a scala program and run it, then
>>> all the data goes to only one node. We have run it multiple times, but
>>> this
>>> behavior does not change. This seems strange.
>>>
>>> Is there some problem with the way we use HashPartitioner?
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Raghava.
>>>
>>
>>
>
>
> --
> Regards,
> Raghava
> http://raghavam.github.io
>


-- 
Thanks,
Mike

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
Few params like- spark.task.cpus, spark.cores.max will help. Also, for 52MB
of data you need not have 12GB allocated to executors. Better to assign
512MB or so and increase the number of executors per worker node. Try
reducing that executor memory to 512MB or so for this case.

On Mon, Apr 18, 2016 at 9:07 AM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Yes its the same data.
>
> 1) The number of partitions are the same (8, which is an argument to the
> HashPartitioner). In the first case, these partitions are spread across
> both the worker nodes. In the second case, all the partitions are on the
> same node.
> 2) What resources would be of interest here? Scala shell takes the default
> parameters since we use "bin/spark-shell --master " to run the
> scala-shell. For the scala program, we do set some configuration options
> such as driver memory (12GB), parallelism is set to 8 and we use Kryo
> serializer.
>
> We are running this on Azure D3-v2 machines which have 4 cores and 14GB
> RAM.1 executor runs on each worker node. Following configuration options
> are set for the scala program -- perhaps we should move it to the spark
> config file.
>
> Driver memory and executor memory are set to 12GB
> parallelism is set to 8
> Kryo serializer is used
> Number of retainedJobs and retainedStages has been increased to check them
> in the UI.
>
> What information regarding Spark Context would be of interest here?
>
> Regards,
> Raghava.
>
> On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar  wrote:
>
>> If the data file is same then it should have similar distribution of
>> keys. Few queries-
>>
>> 1. Did you compare the number of partitions in both the cases?
>> 2. Did you compare the resource allocation for Spark Shell vs Scala
>> Program being submitted?
>>
>> Also, can you please share the details of Spark Context, Environment and
>> Executors when you run via Scala program?
>>
>> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
>> m.vijayaragh...@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> We are using HashPartitioner in the following way on a 3 node cluster (1
>>> master and 2 worker nodes).
>>>
>>> val u =
>>> sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
>>> Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
>>> x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()
>>>
>>> u.count()
>>>
>>> If we run this from the spark shell, the data (52 MB) is split across
>>> the two worker nodes. But if we put this in a scala program and run it,
>>> then all the data goes to only one node. We have run it multiple times, but
>>> this behavior does not change. This seems strange.
>>>
>>> Is there some problem with the way we use HashPartitioner?
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Raghava.
>>>
>>
>>
>
>
> --
> Regards,
> Raghava
> http://raghavam.github.io
>


Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
Yes its the same data.

1) The number of partitions are the same (8, which is an argument to the
HashPartitioner). In the first case, these partitions are spread across
both the worker nodes. In the second case, all the partitions are on the
same node.
2) What resources would be of interest here? Scala shell takes the default
parameters since we use "bin/spark-shell --master " to run the
scala-shell. For the scala program, we do set some configuration options
such as driver memory (12GB), parallelism is set to 8 and we use Kryo
serializer.

We are running this on Azure D3-v2 machines which have 4 cores and 14GB
RAM.1 executor runs on each worker node. Following configuration options
are set for the scala program -- perhaps we should move it to the spark
config file.

Driver memory and executor memory are set to 12GB
parallelism is set to 8
Kryo serializer is used
Number of retainedJobs and retainedStages has been increased to check them
in the UI.

What information regarding Spark Context would be of interest here?

Regards,
Raghava.

On Sun, Apr 17, 2016 at 10:54 PM, Anuj Kumar  wrote:

> If the data file is same then it should have similar distribution of keys.
> Few queries-
>
> 1. Did you compare the number of partitions in both the cases?
> 2. Did you compare the resource allocation for Spark Shell vs Scala
> Program being submitted?
>
> Also, can you please share the details of Spark Context, Environment and
> Executors when you run via Scala program?
>
> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Hello All,
>>
>> We are using HashPartitioner in the following way on a 3 node cluster (1
>> master and 2 worker nodes).
>>
>> val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
>> Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
>> x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()
>>
>> u.count()
>>
>> If we run this from the spark shell, the data (52 MB) is split across the
>> two worker nodes. But if we put this in a scala program and run it, then
>> all the data goes to only one node. We have run it multiple times, but this
>> behavior does not change. This seems strange.
>>
>> Is there some problem with the way we use HashPartitioner?
>>
>> Thanks in advance.
>>
>> Regards,
>> Raghava.
>>
>
>


-- 
Regards,
Raghava
http://raghavam.github.io


Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
If the data file is same then it should have similar distribution of keys.
Few queries-

1. Did you compare the number of partitions in both the cases?
2. Did you compare the resource allocation for Spark Shell vs Scala Program
being submitted?

Also, can you please share the details of Spark Context, Environment and
Executors when you run via Scala program?

On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Hello All,
>
> We are using HashPartitioner in the following way on a 3 node cluster (1
> master and 2 worker nodes).
>
> val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
> Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
> x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()
>
> u.count()
>
> If we run this from the spark shell, the data (52 MB) is split across the
> two worker nodes. But if we put this in a scala program and run it, then
> all the data goes to only one node. We have run it multiple times, but this
> behavior does not change. This seems strange.
>
> Is there some problem with the way we use HashPartitioner?
>
> Thanks in advance.
>
> Regards,
> Raghava.
>


Fwd: [Help]:Strange Issue :Debug Spark Dataframe code

2016-04-17 Thread Divya Gehlot
Reposting again as unable to find the root cause where things are going
wrong.

Experts please help .


-- Forwarded message --
From: Divya Gehlot 
Date: 15 April 2016 at 19:13
Subject: [Help]:Strange Issue :Debug Spark Dataframe code
To: "user @spark" 


Hi,
I am using Spark 1.5.2 with Scala 2.10.
Is there any other option apart from "explain(true)" to debug Spark
Dataframe code .

I am facing strange issue .
I have a lookuo dataframe and using it join another  dataframe on different
columns .

I am getting *Analysis exception* in third join.
When I checked  the logical plan ,  its using the same reference for key
but while selecting the columns reference  are changing.
For example
df1 = COLUMN1#15,COLUMN2#16,COLUMN3#17

In first two joins
I am getting the same reference and joining is happening
For first two join  the column  COLUMN1#15  I am getting the COLUMN2#16 and
COLUMN3#17.

But at third join COLUMN1#15 is same but the other column reference are
updating as  COLUMN2#167,COLUMN3#168

Its throwing Spark Analysis Exception

> org.apache.spark.sql.AnalysisException: resolved attribute(s) COLUMN1#15
> missing from


after two joins,the  dataframe has more than 25 columns

Could anybody help light the path by holding the torch.
Would really appreciate the help.

Thanks,
Divya


Re: WELCOME to user@spark.apache.org

2016-04-17 Thread Hyukjin Kwon
Hi Jinan,


There are some examples for XML here,
https://github.com/databricks/spark-xml/blob/master/src/test/java/com/databricks/spark/xml/JavaXmlSuite.java
for test codes.

Or, you can see documentation in README.md.
https://github.com/databricks/spark-xml#java-api.


There are other basic Java examples here,
https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples
.


Basic steps are explained well in a book, Learning Spark (you can just
google it).


I also see this is explained well in official document here,
http://spark.apache.org/docs/latest/programming-guide.html.


I hope this can help


Thanks!



2016-04-18 9:37 GMT+09:00 jinan_alhajjaj :

> Hello,
> I would like to know how to parse XML files using Apache spark by java
> language.  I am doing this for my senior project and I am a beginner in
> Apache Spark and I have just a little experience with spark.
>  Thank you.
> On Apr 18, 2016, at 3:14 AM, user-h...@spark.apache.org wrote:
>
> Hi! This is the ezmlm program. I'm managing the
> user@spark.apache.org mailing list.
>
> Acknowledgment: I have added the address
>
>   j.r.alhaj...@hotmail.com
>
> to the user mailing list.
>
> Welcome to user@spark.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   
>
> To remove your address from the list, send a message to:
>   
>
> Send mail to the following for info and FAQ for this list:
>   
>   
>
> Similar addresses exist for the digest list:
>   
>   
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   
>
> To get an index with subject and author for messages 123-456 , mail:
>   
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 

Re: Apache Flink

2016-04-17 Thread Todd Nist
So there is an offering from Stratio, https://github.com/Stratio/Decision

Decision CEP engine is a Complex Event Processing platform built on Spark
> Streaming.
>


> It is the result of combining the power of Spark Streaming as a continuous
> computing framework and Siddhi CEP engine as complex event processing
> engine.


https://stratio.atlassian.net/wiki/display/DECISION0x9/Home

I have not used it, only read about it but it may be of some interest to
you.

-Todd

On Sun, Apr 17, 2016 at 5:49 PM, Peyman Mohajerian 
wrote:

> Microbatching is certainly not a waste of time, you are making way too
> strong of an statement. In fact in certain cases one tuple at the time
> makes no sense, it all depends on the use cases. In fact if you understand
> the history of the project Storm you would know that microbatching was
> added later in Storm, Trident, and it is specifically for
> microbatching/windowing.
> In certain cases you are doing aggregation/windowing and throughput is the
> dominant design consideration and you don't care what each individual
> event/tuple does, e.g. of you push different event types to separate kafka
> topics and all you care is to do a count, what is the need for single event
> processing.
>
> On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet  wrote:
>
>> i have not been intrigued at all by the microbatching concept in Spark. I
>> am used to CEP in real streams processing environments like Infosphere
>> Streams & Storm where the granularity of processing is at the level of each
>> individual tuple and processing units (workers) can react immediately to
>> events being received and processed. The closest Spark streaming comes to
>> this concept is the notion of "state" that that can be updated via the
>> "updateStateBykey()" functions which are only able to be run in a
>> microbatch. Looking at the expected design changes to Spark Streaming in
>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>> the radar for Spark, though I have seen articles stating that more effort
>> is going to go into the Spark SQL layer in Spark streaming which may make
>> it more reminiscent of Esper.
>>
>> For these reasons, I have not even tried to implement CEP in Spark. I
>> feel it's a waste of time without immediate tuple-at-a-time processing.
>> Without this, they avoid the whole problem of "back pressure" (though keep
>> in mind, it is still very possible to overload the Spark streaming layer
>> with stages that will continue to pile up and never get worked off) but
>> they lose the granular control that you get in CEP environments by allowing
>> the rules & processors to react with the receipt of each tuple, right away.
>>
>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>> [1] on top of Apache Storm as an example of what such a design may look
>> like. It looks like Storm is going to be replaced in the not so distant
>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>> open source implementation as of yet.
>>
>> [1] https://github.com/calrissian/flowmix
>>
>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Corey,
>>>
>>> Can you please point me to docs on using Spark for CEP? Do we have a set
>>> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
>>> for Spark something like below
>>>
>>>
>>>
>>> ​
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>>>
 One thing I've noticed about Flink in my following of the project has
 been that it has established, in a few cases, some novel ideas and
 improvements over Spark. The problem with it, however, is that both the
 development team and the community around it are very small and many of
 those novel improvements have been rolled directly into Spark in subsequent
 versions. I was considering changing over my architecture to Flink at one
 point to get better, more real-time CEP streaming support, but in the end I
 decided to stick with Spark and just watch Flink continue to pressure it
 into improvement.

 On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
 wrote:

> i never found much info that flink was actually designed to be fault
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
> doesn't bode well for large scale data processing. spark was designed with
> fault tolerance in mind from the beginning.
>
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I 

Re: Apache Flink

2016-04-17 Thread Corey Nolet
Peyman,

I'm sorry, I missed the comment that microbatching was a waste of time. Did
someone mention this? I know this thread got pretty long so I may have
missed it somewhere above.

My comment about Spark's microbatching being a downside is stricly in
reference to CEP. Complex CEP flows are reactive and the batched streaming
technique that Spark's architecture utilizes is not very easy for
programming real-time reactive designs. The thing is, many good streaming
engines start with just that, the streaming engine. They start at the core
with an architecture that generally promotes tuple-at-a-time. Whatever they
build on top of that is strictly just to make other use-cases easier to
implement, hence the main difference between Flink and Spark.

Storm, Esper and Infosphere Streams are three examples of this that come to
mind very quickly. All three of them are powerful tuple-at-a-time streams
processing engines under the hood and all 3 of them also have abstractions
 built on top of that core that make it easier to implement more specific
and more batch processing paradigms. Flink is similar to this.

I hope you didn't take my comment as an attack that Spark's microbatching
does not follow a traditional design at it's core as most well-accepted
streams processing framework have in the past. I am not implying that
microbatching is not useful in some use cases. What I am implying is that
it does not make real-time reactive environments very easy to implement.



On Sun, Apr 17, 2016 at 8:49 PM, Peyman Mohajerian 
wrote:

> Microbatching is certainly not a waste of time, you are making way too
> strong of an statement. In fact in certain cases one tuple at the time
> makes no sense, it all depends on the use cases. In fact if you understand
> the history of the project Storm you would know that microbatching was
> added later in Storm, Trident, and it is specifically for
> microbatching/windowing.
> In certain cases you are doing aggregation/windowing and throughput is the
> dominant design consideration and you don't care what each individual
> event/tuple does, e.g. of you push different event types to separate kafka
> topics and all you care is to do a count, what is the need for single event
> processing.
>
> On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet  wrote:
>
>> i have not been intrigued at all by the microbatching concept in Spark. I
>> am used to CEP in real streams processing environments like Infosphere
>> Streams & Storm where the granularity of processing is at the level of each
>> individual tuple and processing units (workers) can react immediately to
>> events being received and processed. The closest Spark streaming comes to
>> this concept is the notion of "state" that that can be updated via the
>> "updateStateBykey()" functions which are only able to be run in a
>> microbatch. Looking at the expected design changes to Spark Streaming in
>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>> the radar for Spark, though I have seen articles stating that more effort
>> is going to go into the Spark SQL layer in Spark streaming which may make
>> it more reminiscent of Esper.
>>
>> For these reasons, I have not even tried to implement CEP in Spark. I
>> feel it's a waste of time without immediate tuple-at-a-time processing.
>> Without this, they avoid the whole problem of "back pressure" (though keep
>> in mind, it is still very possible to overload the Spark streaming layer
>> with stages that will continue to pile up and never get worked off) but
>> they lose the granular control that you get in CEP environments by allowing
>> the rules & processors to react with the receipt of each tuple, right away.
>>
>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>> [1] on top of Apache Storm as an example of what such a design may look
>> like. It looks like Storm is going to be replaced in the not so distant
>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>> open source implementation as of yet.
>>
>> [1] https://github.com/calrissian/flowmix
>>
>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Corey,
>>>
>>> Can you please point me to docs on using Spark for CEP? Do we have a set
>>> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
>>> for Spark something like below
>>>
>>>
>>>
>>> ​
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>>>
 One thing I've noticed about Flink in my following of the project has
 been that it has established, in a few cases, some novel ideas and
 

Re: Apache Flink

2016-04-17 Thread Peyman Mohajerian
Microbatching is certainly not a waste of time, you are making way too
strong of an statement. In fact in certain cases one tuple at the time
makes no sense, it all depends on the use cases. In fact if you understand
the history of the project Storm you would know that microbatching was
added later in Storm, Trident, and it is specifically for
microbatching/windowing.
In certain cases you are doing aggregation/windowing and throughput is the
dominant design consideration and you don't care what each individual
event/tuple does, e.g. of you push different event types to separate kafka
topics and all you care is to do a count, what is the need for single event
processing.

On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet  wrote:

> i have not been intrigued at all by the microbatching concept in Spark. I
> am used to CEP in real streams processing environments like Infosphere
> Streams & Storm where the granularity of processing is at the level of each
> individual tuple and processing units (workers) can react immediately to
> events being received and processed. The closest Spark streaming comes to
> this concept is the notion of "state" that that can be updated via the
> "updateStateBykey()" functions which are only able to be run in a
> microbatch. Looking at the expected design changes to Spark Streaming in
> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
> the radar for Spark, though I have seen articles stating that more effort
> is going to go into the Spark SQL layer in Spark streaming which may make
> it more reminiscent of Esper.
>
> For these reasons, I have not even tried to implement CEP in Spark. I feel
> it's a waste of time without immediate tuple-at-a-time processing. Without
> this, they avoid the whole problem of "back pressure" (though keep in mind,
> it is still very possible to overload the Spark streaming layer with stages
> that will continue to pile up and never get worked off) but they lose the
> granular control that you get in CEP environments by allowing the rules &
> processors to react with the receipt of each tuple, right away.
>
> Awhile back, I did attempt to implement an InfoSphere Streams-like API [1]
> on top of Apache Storm as an example of what such a design may look like.
> It looks like Storm is going to be replaced in the not so distant future by
> Twitter's new design called Heron. IIRC, Heron does not have an open source
> implementation as of yet.
>
> [1] https://github.com/calrissian/flowmix
>
> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Corey,
>>
>> Can you please point me to docs on using Spark for CEP? Do we have a set
>> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
>> for Spark something like below
>>
>>
>>
>> ​
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>>
>>> One thing I've noticed about Flink in my following of the project has
>>> been that it has established, in a few cases, some novel ideas and
>>> improvements over Spark. The problem with it, however, is that both the
>>> development team and the community around it are very small and many of
>>> those novel improvements have been rolled directly into Spark in subsequent
>>> versions. I was considering changing over my architecture to Flink at one
>>> point to get better, more real-time CEP streaming support, but in the end I
>>> decided to stick with Spark and just watch Flink continue to pressure it
>>> into improvement.
>>>
>>> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
>>> wrote:
>>>
 i never found much info that flink was actually designed to be fault
 tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
 doesn't bode well for large scale data processing. spark was designed with
 fault tolerance in mind from the beginning.

 On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> I read the benchmark published by Yahoo. Obviously they already use
> Storm and inevitably very familiar with that tool. To start with although
> these benchmarks were somehow interesting IMO, it lend itself to an
> assurance that the tool chosen for their platform is still the best 
> choice.
> So inevitably the benchmarks and the tests were done to support
> primary their approach.
>
> In general anything which is not done through TCP Council or similar
> body is questionable..
> Their argument is that because Spark handles data streaming in micro
> batches then inevitably it introduces this in-built latency as per 

Re: WELCOME to user@spark.apache.org

2016-04-17 Thread jinan_alhajjaj
Hello,
I would like to know how to parse XML files using Apache spark by java 
language.  I am doing this for my senior project and I am a beginner in Apache 
Spark and I have just a little experience with spark. 
 Thank you. 
On Apr 18, 2016, at 3:14 AM, user-h...@spark.apache.org wrote:

> Hi! This is the ezmlm program. I'm managing the
> user@spark.apache.org mailing list.
> 
> Acknowledgment: I have added the address
> 
>   j.r.alhaj...@hotmail.com
> 
> to the user mailing list.
> 
> Welcome to user@spark.apache.org!
> 
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
> 
> 
> --- Administrative commands for the user list ---
> 
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
> 
> To subscribe to the list, send a message to:
>   
> 
> To remove your address from the list, send a message to:
>   
> 
> Send mail to the following for info and FAQ for this list:
>   
>   
> 
> Similar addresses exist for the digest list:
>   
>   
> 
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   
> 
> To get an index with subject and author for messages 123-456 , mail:
>   
> 
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
> 
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   
> 
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
> 
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 

Re: Apache Flink

2016-04-17 Thread Otis Gospodnetić
While Flink may not be younger than Spark, Spark came to Apache first,
which always helps.  Plus, there was already a lot of buzz around Spark
before it came to Apache.  Coming from Berkeley also helps.

That said, Flink seems decently healthy to me:
- http://search-hadoop.com/?fc_project=Flink_type=mail+_hash_+user=
- http://search-hadoop.com/?fc_project=Flink_type=mail+_hash_+dev=
-
http://search-hadoop.com/?fc_project=Flink_type=issue==144547200=146102400

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Sun, Apr 17, 2016 at 5:55 PM, Mich Talebzadeh 
wrote:

> Assuming that both Spark and Flink are contemporaries what are the reasons
> that Flink has not been adopted widely? (this may sound obvious and or
> prejudged). I mean Spark has surged in popularity in the past year if I am
> correct
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 22:49, Michael Malak  wrote:
>
>> In terms of publication date, a paper on Nephele was published in 2009,
>> prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of
>> Stratosphere, which became Flink.
>>
>>
>> --
>> *From:* Mark Hamstra 
>> *To:* Mich Talebzadeh 
>> *Cc:* Corey Nolet ; "user @spark" <
>> user@spark.apache.org>
>> *Sent:* Sunday, April 17, 2016 3:30 PM
>> *Subject:* Re: Apache Flink
>>
>> To be fair, the Stratosphere project from which Flink springs was started
>> as a collaborative university research project in Germany about the same
>> time that Spark was first released as Open Source, so they are near
>> contemporaries rather than Flink having been started only well after Spark
>> was an established and widely-used Apache project.
>>
>> On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> Also it always amazes me why they are so many tangential projects in Big
>> Data space? Would not it be easier if efforts were spent on adding to Spark
>> functionality rather than creating a new product like Flink?
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 17 April 2016 at 21:08, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Corey for the useful info.
>>
>> I have used Sybase Aleri and StreamBase as commercial CEPs engines.
>> However, there does not seem to be anything close to these products in
>> Hadoop Ecosystem. So I guess there is nothing there?
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 17 April 2016 at 20:43, Corey Nolet  wrote:
>>
>> i have not been intrigued at all by the microbatching concept in Spark. I
>> am used to CEP in real streams processing environments like Infosphere
>> Streams & Storm where the granularity of processing is at the level of each
>> individual tuple and processing units (workers) can react immediately to
>> events being received and processed. The closest Spark streaming comes to
>> this concept is the notion of "state" that that can be updated via the
>> "updateStateBykey()" functions which are only able to be run in a
>> microbatch. Looking at the expected design changes to Spark Streaming in
>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>> the radar for Spark, though I have seen articles stating that more effort
>> is going to go into the Spark SQL layer in Spark streaming which may make
>> it more reminiscent of Esper.
>>
>> For these reasons, I have not even tried to implement CEP in Spark. I
>> feel it's a waste of time without immediate tuple-at-a-time processing.
>> Without this, they avoid the whole problem of "back pressure" (though keep
>> in mind, it is still very possible to overload the Spark streaming layer
>> with stages that will continue to pile up and never get worked off) but
>> they lose the granular control that you get in CEP environments by allowing
>> the rules & processors to react with the receipt of each tuple, right away.
>>
>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>> [1] on top of Apache Storm as an example of what such a design may look
>> like. It looks like Storm is going to be replaced in the 

strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
Hello All,

We are using HashPartitioner in the following way on a 3 node cluster (1
master and 2 worker nodes).

val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()

u.count()

If we run this from the spark shell, the data (52 MB) is split across the
two worker nodes. But if we put this in a scala program and run it, then
all the data goes to only one node. We have run it multiple times, but this
behavior does not change. This seems strange.

Is there some problem with the way we use HashPartitioner?

Thanks in advance.

Regards,
Raghava.


Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Rajesh Balamohan
1. In first case (i.e in cluster where you have hive and spark), it would
have executed via HiveTableScan instead of OrcRelation. HiveTableScan would
not propagate any PPD related information to ORC readers (SPARK-12998). PPD
might not play a big role here as your where conditions seem to be only on
event_date partition.  Check if the time taken to launch itself is taking
time. In 1.6.x, HiveTableScan gets a threaddump for every reader.
SPARK-12898 might improve if that is the case. Also, SPARK-12925 can
improve perf if you are using lots of String columns.

2. In the second case (i.e in cluster where you have spark alone), can you
check if it is taking time launch the job or is it slow overall?. You can
possibly check the timing from spark-ui and compare it with the overall job
runtime,
- SPARK-12925 as mentioned in the first case can also impact perf in case
you have lots string columns.
- ORCrelation internally makes use of HadoopRDD. In case there are lots of
partitions created in your case, SPARK-14113 can have an impact as closure
cleaning is called every time an RDD is created. This has the potential to
slow down the overall job runtime depending on the number of partitions.
- OrcRelation suffers from callSite problem as well, so for every partition
it incurs the threaddump call which can impact depending on the number of
partitions being crated.
- Parquet caches lots of meta information. So the round trips to NN is
fairly low in case of Parq. So when you run the query second or third time,
Parq queries tend to run in lesser runtime. In ORCRelation, there is no
such meta caching and still incurs the NN cost.

~Rajesh.B

On Mon, Apr 18, 2016 at 3:29 AM, Maurin Lenglart 
wrote:

>
> Let me explain a little my architecture:
> I have one cluster with hive and spark. Over there I create my databases
> and create the tables and insert data in them.
> If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) using ORC, it take 15 sec
>
> But then I export the tables to an other cluster where I don’t have hive.
> So I load my tables using
>
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
> Then this query: self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 50 seconds.
>
> If I do the same process using parquet tables self.sqlContext.sql(“SELECT
> `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`)
> as `dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND
> `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8
> seconds.
>
>
> thanks
>
> From: Mich Talebzadeh 
> Date: Sunday, April 17, 2016 at 2:52 PM
>
> To: maurin lenglart 
> Cc: "user @spark" 
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> hang on so it takes 15 seconds to switch the database context with 
> HiveContext.sql("use
> myDatabase") ?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 22:44, Maurin Lenglart  wrote:
>
>> The stats are only for one file in one partition. There is 17970737 rows
>> in total.
>> The table is not bucketed.
>>
>> The problem is not inserting rows, the problem is with this SQL query:
>>
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 2”
>>
>> Benchmarks :
>>
>>- 8 seconds on parquet table loaded using
>> 
>> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 50 seconds on ORC
>>using  
>> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 15 seconds on ORC using sqlContext(‘use myDatabase’)
>>
>> The use case that I have is the second and slowest benchmark. Is there
>> something I can do to speed that up?
>>
>> thanks
>>
>>
>>
>> From: Mich Talebzadeh 
>> Date: Sunday, April 17, 2016 at 2:22 PM
>>
>> To: maurin lenglart 
>> Cc: "user @spark" 
>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>
>> Hi Maurin,
>>
>> Have you tried to create your table in Hive as parquet table? This table
>> is pretty small with 100K rows.
>>

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
Ok that may explain. In another cluster you register it as temp table and
then collect data using SQL running against that temp table which loads
the data at that point and if you do not have enough memory for your temp
table, it will have to spill it to disk and do many passes. Could that be a
plausible explanation without knowing the details?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:59, Maurin Lenglart  wrote:

>
> Let me explain a little my architecture:
> I have one cluster with hive and spark. Over there I create my databases
> and create the tables and insert data in them.
> If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) using ORC, it take 15 sec
>
> But then I export the tables to an other cluster where I don’t have hive.
> So I load my tables using
>
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
> Then this query: self.sqlContext.sql(“SELECT `event_date` as
> `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews`
> FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <=
> '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 50 seconds.
>
> If I do the same process using parquet tables self.sqlContext.sql(“SELECT
> `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`)
> as `dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND
> `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8
> seconds.
>
>
> thanks
>
> From: Mich Talebzadeh 
> Date: Sunday, April 17, 2016 at 2:52 PM
>
> To: maurin lenglart 
> Cc: "user @spark" 
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> hang on so it takes 15 seconds to switch the database context with 
> HiveContext.sql("use
> myDatabase") ?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 22:44, Maurin Lenglart  wrote:
>
>> The stats are only for one file in one partition. There is 17970737 rows
>> in total.
>> The table is not bucketed.
>>
>> The problem is not inserting rows, the problem is with this SQL query:
>>
>> “SELECT `event_date` as `event_date`,sum(`bookings`) as
>> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
>> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
>> LIMIT 2”
>>
>> Benchmarks :
>>
>>- 8 seconds on parquet table loaded using
>> 
>> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 50 seconds on ORC
>>using  
>> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>>- 15 seconds on ORC using sqlContext(‘use myDatabase’)
>>
>> The use case that I have is the second and slowest benchmark. Is there
>> something I can do to speed that up?
>>
>> thanks
>>
>>
>>
>> From: Mich Talebzadeh 
>> Date: Sunday, April 17, 2016 at 2:22 PM
>>
>> To: maurin lenglart 
>> Cc: "user @spark" 
>> Subject: Re: orc vs parquet aggregation, orc is really slow
>>
>> Hi Maurin,
>>
>> Have you tried to create your table in Hive as parquet table? This table
>> is pretty small with 100K rows.
>>
>> Is Hive table bucketed at all? I gather your issue is inserting rows into
>> Hive table at the moment that taking longer time (compared to Parquet)?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 21:43, Maurin Lenglart 
>> wrote:
>>
>>> Hi,
>>> I am using cloudera distribution, and when I do a" desc formatted table”
>>> I don t get all the table parameters.
>>>
>>> But I did a hive orcfiledump on one random file ( I replaced some of the
>>> values that can be sensible) :
>>> hive --orcfiledump
>>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
>>> 2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The
>>> hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
>>> Continuing without it.
>>> Structure for 

Re: Apache Flink

2016-04-17 Thread Michael Malak
As with all history, "what if"s are not scientifically testable hypotheses, but 
my speculation is the energy (VCs, startups, big Internet companies, 
universities) within Silicon Valley contrasted to Germany.

  From: Mich Talebzadeh 
 To: Michael Malak ; "user @spark" 
 
 Sent: Sunday, April 17, 2016 3:55 PM
 Subject: Re: Apache Flink
   
Assuming that both Spark and Flink are contemporaries what are the reasons that 
Flink has not been adopted widely? (this may sound obvious and or prejudged). I 
mean Spark has surged in popularity in the past year if I am correct
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 22:49, Michael Malak  wrote:

In terms of publication date, a paper on Nephele was published in 2009, prior 
to the 2010 USENIX paper on Spark. Nephele is the execution engine of 
Stratosphere, which became Flink.

  From: Mark Hamstra 
 To: Mich Talebzadeh  
Cc: Corey Nolet ; "user @spark" 
 Sent: Sunday, April 17, 2016 3:30 PM
 Subject: Re: Apache Flink
  
To be fair, the Stratosphere project from which Flink springs was started as a 
collaborative university research project in Germany about the same time that 
Spark was first released as Open Source, so they are near contemporaries rather 
than Flink having been started only well after Spark was an established and 
widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh  
wrote:

Also it always amazes me why they are so many tangential projects in Big Data 
space? Would not it be easier if efforts were spent on adding to Spark 
functionality rather than creating a new product like Flink?
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 21:08, Mich Talebzadeh  wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, 
there does not seem to be anything close to these products in Hadoop Ecosystem. 
So I guess there is nothing there?
Regards.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 20:43, Corey Nolet  wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am 
used to CEP in real streams processing environments like Infosphere Streams & 
Storm where the granularity of processing is at the level of each individual 
tuple and processing units (workers) can react immediately to events being 
received and processed. The closest Spark streaming comes to this concept is 
the notion of "state" that that can be updated via the "updateStateBykey()" 
functions which are only able to be run in a microbatch. Looking at the 
expected design changes to Spark Streaming in Spark 2.0.0, it also does not 
look like tuple-at-a-time processing is on the radar for Spark, though I have 
seen articles stating that more effort is going to go into the Spark SQL layer 
in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's 
a waste of time without immediate tuple-at-a-time processing. Without this, 
they avoid the whole problem of "back pressure" (though keep in mind, it is 
still very possible to overload the Spark streaming layer with stages that will 
continue to pile up and never get worked off) but they lose the granular 
control that you get in CEP environments by allowing the rules & processors to 
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on 
top of Apache Storm as an example of what such a design may look like. It looks 
like Storm is going to be replaced in the not so distant future by Twitter's 
new design called Heron. IIRC, Heron does not have an open source 
implementation as of yet. 
[1] https://github.com/calrissian/flowmix
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh  
wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP 
libraries somewhere. I am keen on getting hold of adaptor libraries for Spark 
something like below


​Thanks

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 16:07, Corey Nolet  wrote:

One thing I've noticed about Flink in my following of the project has been that 
it has established, in a few cases, some novel 

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart

Let me explain a little my architecture:
I have one cluster with hive and spark. Over there I create my databases and 
create the tables and insert data in them.
If I execute this query  :self.sqlContext.sql(“SELECT `event_date` as 
`event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM 
myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' 
GROUP BY `event_date` LIMIT 2”) using ORC, it take 15 sec

But then I export the tables to an other cluster where I don’t have hive. So I 
load my tables using
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
Then this query: self.sqlContext.sql(“SELECT `event_date` as 
`event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM 
myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' 
GROUP BY `event_date` LIMIT 2”) take 50 seconds.

If I do the same process using parquet tables self.sqlContext.sql(“SELECT 
`event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as 
`dealviews` FROM myTable WHERE  `event_date` >= '2016-01-06' AND `event_date` 
<= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8 seconds.


thanks

From: Mich Talebzadeh 
>
Date: Sunday, April 17, 2016 at 2:52 PM
To: maurin lenglart >
Cc: "user @spark" >
Subject: Re: orc vs parquet aggregation, orc is really slow

hang on so it takes 15 seconds to switch the database context with 
HiveContext.sql("use myDatabase") ?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:44, Maurin Lenglart 
> wrote:
The stats are only for one file in one partition. There is 17970737 rows in 
total.
The table is not bucketed.

The problem is not inserting rows, the problem is with this SQL query:

“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”

Benchmarks :

  *   8 seconds on parquet table loaded using  
sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   50 seconds on ORC using  
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   15 seconds on ORC using sqlContext(‘use myDatabase’)

The use case that I have is the second and slowest benchmark. Is there 
something I can do to speed that up?

thanks



From: Mich Talebzadeh 
>
Date: Sunday, April 17, 2016 at 2:22 PM

To: maurin lenglart >
Cc: "user @spark" >
Subject: Re: orc vs parquet aggregation, orc is really slow

Hi Maurin,

Have you tried to create your table in Hive as parquet table? This table is 
pretty small with 100K rows.

Is Hive table bucketed at all? I gather your issue is inserting rows into Hive 
table at the moment that taking longer time (compared to Parquet)?

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 17 April 2016 at 21:43, Maurin Lenglart 
> wrote:
Hi,
I am using cloudera distribution, and when I do a" desc formatted table” I don 
t get all the table parameters.

But I did a hive orcfiledump on one random file ( I replaced some of the values 
that can be sensible) :
hive --orcfiledump 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The 
hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  
Continuing without it.
Structure for 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
File Version: 0.12 with HIVE_8732
16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1 with 
{include: null, offset: 0, length: 9223372036854775807}
Rows: 104260
Compression: ZLIB
Compression size: 262144
Type: struct

Stripe Statistics:
  Stripe 1:
Column 0: count: 104260 hasNull: false
Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
Column 5: count: 104260 hasNull: false min: XXX max: 

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Assuming that both Spark and Flink are contemporaries what are the reasons
that Flink has not been adopted widely? (this may sound obvious and or
prejudged). I mean Spark has surged in popularity in the past year if I am
correct

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:49, Michael Malak  wrote:

> In terms of publication date, a paper on Nephele was published in 2009,
> prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of
> Stratosphere, which became Flink.
>
>
> --
> *From:* Mark Hamstra 
> *To:* Mich Talebzadeh 
> *Cc:* Corey Nolet ; "user @spark" <
> user@spark.apache.org>
> *Sent:* Sunday, April 17, 2016 3:30 PM
> *Subject:* Re: Apache Flink
>
> To be fair, the Stratosphere project from which Flink springs was started
> as a collaborative university research project in Germany about the same
> time that Spark was first released as Open Source, so they are near
> contemporaries rather than Flink having been started only well after Spark
> was an established and widely-used Apache project.
>
> On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Also it always amazes me why they are so many tangential projects in Big
> Data space? Would not it be easier if efforts were spent on adding to Spark
> functionality rather than creating a new product like Flink?
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 17 April 2016 at 21:08, Mich Talebzadeh 
> wrote:
>
> Thanks Corey for the useful info.
>
> I have used Sybase Aleri and StreamBase as commercial CEPs engines.
> However, there does not seem to be anything close to these products in
> Hadoop Ecosystem. So I guess there is nothing there?
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 17 April 2016 at 20:43, Corey Nolet  wrote:
>
> i have not been intrigued at all by the microbatching concept in Spark. I
> am used to CEP in real streams processing environments like Infosphere
> Streams & Storm where the granularity of processing is at the level of each
> individual tuple and processing units (workers) can react immediately to
> events being received and processed. The closest Spark streaming comes to
> this concept is the notion of "state" that that can be updated via the
> "updateStateBykey()" functions which are only able to be run in a
> microbatch. Looking at the expected design changes to Spark Streaming in
> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
> the radar for Spark, though I have seen articles stating that more effort
> is going to go into the Spark SQL layer in Spark streaming which may make
> it more reminiscent of Esper.
>
> For these reasons, I have not even tried to implement CEP in Spark. I feel
> it's a waste of time without immediate tuple-at-a-time processing. Without
> this, they avoid the whole problem of "back pressure" (though keep in mind,
> it is still very possible to overload the Spark streaming layer with stages
> that will continue to pile up and never get worked off) but they lose the
> granular control that you get in CEP environments by allowing the rules &
> processors to react with the receipt of each tuple, right away.
>
> Awhile back, I did attempt to implement an InfoSphere Streams-like API [1]
> on top of Apache Storm as an example of what such a design may look like.
> It looks like Storm is going to be replaced in the not so distant future by
> Twitter's new design called Heron. IIRC, Heron does not have an open source
> implementation as of yet.
>
> [1] https://github.com/calrissian/flowmix
>
> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Hi Corey,
>
> Can you please point me to docs on using Spark for CEP? Do we have a set
> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
> for Spark something like below
>
>
>
> ​
> Thanks
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>

Re: Apache Flink

2016-04-17 Thread Michael Malak
There have been commercial CEP solutions for decades, including from my 
employer.

  From: Mich Talebzadeh 
 To: Mark Hamstra  
Cc: Corey Nolet ; "user @spark" 
 Sent: Sunday, April 17, 2016 3:48 PM
 Subject: Re: Apache Flink
   
The problem is that the strength and wider acceptance of a typical Open source 
project is its sizeable user and development community. When the community is 
small like Flink, then it is not a viable solution to adopt 
I am rather disappointed that no big data project can be used for Complex Event 
Processing as it has wider use in Algorithmic trading among others.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 22:30, Mark Hamstra  wrote:

To be fair, the Stratosphere project from which Flink springs was started as a 
collaborative university research project in Germany about the same time that 
Spark was first released as Open Source, so they are near contemporaries rather 
than Flink having been started only well after Spark was an established and 
widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh  
wrote:

Also it always amazes me why they are so many tangential projects in Big Data 
space? Would not it be easier if efforts were spent on adding to Spark 
functionality rather than creating a new product like Flink?
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 21:08, Mich Talebzadeh  wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, 
there does not seem to be anything close to these products in Hadoop Ecosystem. 
So I guess there is nothing there?
Regards.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 20:43, Corey Nolet  wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am 
used to CEP in real streams processing environments like Infosphere Streams & 
Storm where the granularity of processing is at the level of each individual 
tuple and processing units (workers) can react immediately to events being 
received and processed. The closest Spark streaming comes to this concept is 
the notion of "state" that that can be updated via the "updateStateBykey()" 
functions which are only able to be run in a microbatch. Looking at the 
expected design changes to Spark Streaming in Spark 2.0.0, it also does not 
look like tuple-at-a-time processing is on the radar for Spark, though I have 
seen articles stating that more effort is going to go into the Spark SQL layer 
in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's 
a waste of time without immediate tuple-at-a-time processing. Without this, 
they avoid the whole problem of "back pressure" (though keep in mind, it is 
still very possible to overload the Spark streaming layer with stages that will 
continue to pile up and never get worked off) but they lose the granular 
control that you get in CEP environments by allowing the rules & processors to 
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on 
top of Apache Storm as an example of what such a design may look like. It looks 
like Storm is going to be replaced in the not so distant future by Twitter's 
new design called Heron. IIRC, Heron does not have an open source 
implementation as of yet. 
[1] https://github.com/calrissian/flowmix
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh  
wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP 
libraries somewhere. I am keen on getting hold of adaptor libraries for Spark 
something like below


​Thanks

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 16:07, Corey Nolet  wrote:

One thing I've noticed about Flink in my following of the project has been that 
it has established, in a few cases, some novel ideas and improvements over 
Spark. The problem with it, however, is that both the development team and the 
community around it are very small and many of those novel improvements have 
been rolled directly into Spark in subsequent versions. I was considering 
changing over my architecture to Flink at one point to get better, more 
real-time CEP streaming support, but in the end I 

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
hang on so it takes 15 seconds to switch the database context with
HiveContext.sql("use
myDatabase") ?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:44, Maurin Lenglart  wrote:

> The stats are only for one file in one partition. There is 17970737 rows
> in total.
> The table is not bucketed.
>
> The problem is not inserting rows, the problem is with this SQL query:
>
> “SELECT `event_date` as `event_date`,sum(`bookings`) as
> `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date`
> >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date`
> LIMIT 2”
>
> Benchmarks :
>
>- 8 seconds on parquet table loaded using
> 
> sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
>- 50 seconds on ORC
>using  
> sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
>- 15 seconds on ORC using sqlContext(‘use myDatabase’)
>
> The use case that I have is the second and slowest benchmark. Is there
> something I can do to speed that up?
>
> thanks
>
>
>
> From: Mich Talebzadeh 
> Date: Sunday, April 17, 2016 at 2:22 PM
>
> To: maurin lenglart 
> Cc: "user @spark" 
> Subject: Re: orc vs parquet aggregation, orc is really slow
>
> Hi Maurin,
>
> Have you tried to create your table in Hive as parquet table? This table
> is pretty small with 100K rows.
>
> Is Hive table bucketed at all? I gather your issue is inserting rows into
> Hive table at the moment that taking longer time (compared to Parquet)?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 21:43, Maurin Lenglart  wrote:
>
>> Hi,
>> I am using cloudera distribution, and when I do a" desc formatted table”
>> I don t get all the table parameters.
>>
>> But I did a hive orcfiledump on one random file ( I replaced some of the
>> values that can be sensible) :
>> hive --orcfiledump
>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
>> 2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The
>> hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
>> Continuing without it.
>> Structure for /user/hive/warehouse/myDB.db/mytable
>> /event_date=2016-04-01/part-1
>> File Version: 0.12 with HIVE_8732
>> 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from
>> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
>> with {include: null, offset: 0, length: 9223372036854775807}
>> Rows: 104260
>> Compression: ZLIB
>> Compression size: 262144
>> Type: struct> and 13 and  >
>>
>> Stripe Statistics:
>>   Stripe 1:
>> Column 0: count: 104260 hasNull: false
>> Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>> Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>> Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
>> Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>> Column 5: count: 104260 hasNull: false min: XXX max: Others sum:
>> 262478
>> Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>> Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>> Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
>> Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
>> 730846
>> Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum:
>> 5481629
>> Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>> Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
>> 324522.0
>> Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
>> 6958348.122699987
>> Column 14: count: 104260 hasNull: false min: -2273.0 max:
>> 39930.13977860418 sum: 1546639.6964531767
>> Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>> Column 16: count: 104260 hasNull: false min: 0.0 max:
>> 4824.029119913681 sum: 45711.881143417035
>> Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>> Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>> Column 19: count: 104260 hasNull: false min: 2016-04-03 max:
>> 2016-04-03 sum: 1042600
>>
>> File Statistics:
>>   Column 0: count: 104260 hasNull: false
>>   Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>>   Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>>   Column 3: 

Re: Apache Flink

2016-04-17 Thread Michael Malak
In terms of publication date, a paper on Nephele was published in 2009, prior 
to the 2010 USENIX paper on Spark. Nephele is the execution engine of 
Stratosphere, which became Flink.

  From: Mark Hamstra 
 To: Mich Talebzadeh  
Cc: Corey Nolet ; "user @spark" 
 Sent: Sunday, April 17, 2016 3:30 PM
 Subject: Re: Apache Flink
   
To be fair, the Stratosphere project from which Flink springs was started as a 
collaborative university research project in Germany about the same time that 
Spark was first released as Open Source, so they are near contemporaries rather 
than Flink having been started only well after Spark was an established and 
widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh  
wrote:

Also it always amazes me why they are so many tangential projects in Big Data 
space? Would not it be easier if efforts were spent on adding to Spark 
functionality rather than creating a new product like Flink?
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 21:08, Mich Talebzadeh  wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, 
there does not seem to be anything close to these products in Hadoop Ecosystem. 
So I guess there is nothing there?
Regards.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 20:43, Corey Nolet  wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am 
used to CEP in real streams processing environments like Infosphere Streams & 
Storm where the granularity of processing is at the level of each individual 
tuple and processing units (workers) can react immediately to events being 
received and processed. The closest Spark streaming comes to this concept is 
the notion of "state" that that can be updated via the "updateStateBykey()" 
functions which are only able to be run in a microbatch. Looking at the 
expected design changes to Spark Streaming in Spark 2.0.0, it also does not 
look like tuple-at-a-time processing is on the radar for Spark, though I have 
seen articles stating that more effort is going to go into the Spark SQL layer 
in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's 
a waste of time without immediate tuple-at-a-time processing. Without this, 
they avoid the whole problem of "back pressure" (though keep in mind, it is 
still very possible to overload the Spark streaming layer with stages that will 
continue to pile up and never get worked off) but they lose the granular 
control that you get in CEP environments by allowing the rules & processors to 
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on 
top of Apache Storm as an example of what such a design may look like. It looks 
like Storm is going to be replaced in the not so distant future by Twitter's 
new design called Heron. IIRC, Heron does not have an open source 
implementation as of yet. 
[1] https://github.com/calrissian/flowmix
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh  
wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP 
libraries somewhere. I am keen on getting hold of adaptor libraries for Spark 
something like below


​Thanks

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 16:07, Corey Nolet  wrote:

One thing I've noticed about Flink in my following of the project has been that 
it has established, in a few cases, some novel ideas and improvements over 
Spark. The problem with it, however, is that both the development team and the 
community around it are very small and many of those novel improvements have 
been rolled directly into Spark in subsequent versions. I was considering 
changing over my architecture to Flink at one point to get better, more 
real-time CEP streaming support, but in the end I decided to stick with Spark 
and just watch Flink continue to pressure it into improvement.
On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers  wrote:

i never found much info that flink was actually designed to be fault tolerant. 
if fault tolerance is more bolt-on/add-on/afterthought then that doesn't bode 
well for large scale data processing. spark was designed with fault tolerance 
in mind from the beginning.

On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh 

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
The problem is that the strength and wider acceptance of a typical Open
source project is its sizeable user and development community. When the
community is small like Flink, then it is not a viable solution to adopt

I am rather disappointed that no big data project can be used for Complex
Event Processing as it has wider use in Algorithmic trading among others.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 22:30, Mark Hamstra  wrote:

> To be fair, the Stratosphere project from which Flink springs was started
> as a collaborative university research project in Germany about the same
> time that Spark was first released as Open Source, so they are near
> contemporaries rather than Flink having been started only well after Spark
> was an established and widely-used Apache project.
>
> On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Also it always amazes me why they are so many tangential projects in Big
>> Data space? Would not it be easier if efforts were spent on adding to Spark
>> functionality rather than creating a new product like Flink?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 21:08, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks Corey for the useful info.
>>>
>>> I have used Sybase Aleri and StreamBase as commercial CEPs engines.
>>> However, there does not seem to be anything close to these products in
>>> Hadoop Ecosystem. So I guess there is nothing there?
>>>
>>> Regards.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 April 2016 at 20:43, Corey Nolet  wrote:
>>>
 i have not been intrigued at all by the microbatching concept in Spark.
 I am used to CEP in real streams processing environments like Infosphere
 Streams & Storm where the granularity of processing is at the level of each
 individual tuple and processing units (workers) can react immediately to
 events being received and processed. The closest Spark streaming comes to
 this concept is the notion of "state" that that can be updated via the
 "updateStateBykey()" functions which are only able to be run in a
 microbatch. Looking at the expected design changes to Spark Streaming in
 Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
 the radar for Spark, though I have seen articles stating that more effort
 is going to go into the Spark SQL layer in Spark streaming which may make
 it more reminiscent of Esper.

 For these reasons, I have not even tried to implement CEP in Spark. I
 feel it's a waste of time without immediate tuple-at-a-time processing.
 Without this, they avoid the whole problem of "back pressure" (though keep
 in mind, it is still very possible to overload the Spark streaming layer
 with stages that will continue to pile up and never get worked off) but
 they lose the granular control that you get in CEP environments by allowing
 the rules & processors to react with the receipt of each tuple, right away.

 Awhile back, I did attempt to implement an InfoSphere Streams-like API
 [1] on top of Apache Storm as an example of what such a design may look
 like. It looks like Storm is going to be replaced in the not so distant
 future by Twitter's new design called Heron. IIRC, Heron does not have an
 open source implementation as of yet.

 [1] https://github.com/calrissian/flowmix

 On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Corey,
>
> Can you please point me to docs on using Spark for CEP? Do we have a
> set of CEP libraries somewhere. I am keen on getting hold of adaptor
> libraries for Spark something like below
>
>
>
> ​
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>
>> One thing I've noticed 

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
The stats are only for one file in one partition. There is 17970737 rows in 
total.
The table is not bucketed.

The problem is not inserting rows, the problem is with this SQL query:

“SELECT `event_date` as `event_date`,sum(`bookings`) as 
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE  `event_date` >= 
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”

Benchmarks :

  *   8 seconds on parquet table loaded using  
sqlContext.read.format(‘parquet').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   50 seconds on ORC using  
sqlContext.read.format(‘orc').load(‘mytableFiles’).registerAsTable(‘myTable’)
  *   15 seconds on ORC using sqlContext(‘use myDatabase’)

The use case that I have is the second and slowest benchmark. Is there 
something I can do to speed that up?

thanks



From: Mich Talebzadeh 
>
Date: Sunday, April 17, 2016 at 2:22 PM
To: maurin lenglart >
Cc: "user @spark" >
Subject: Re: orc vs parquet aggregation, orc is really slow

Hi Maurin,

Have you tried to create your table in Hive as parquet table? This table is 
pretty small with 100K rows.

Is Hive table bucketed at all? I gather your issue is inserting rows into Hive 
table at the moment that taking longer time (compared to Parquet)?

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 17 April 2016 at 21:43, Maurin Lenglart 
> wrote:
Hi,
I am using cloudera distribution, and when I do a" desc formatted table” I don 
t get all the table parameters.

But I did a hive orcfiledump on one random file ( I replaced some of the values 
that can be sensible) :
hive --orcfiledump 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The 
hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  
Continuing without it.
Structure for 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
File Version: 0.12 with HIVE_8732
16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1 with 
{include: null, offset: 0, length: 9223372036854775807}
Rows: 104260
Compression: ZLIB
Compression size: 262144
Type: struct

Stripe Statistics:
  Stripe 1:
Column 0: count: 104260 hasNull: false
Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 
6958348.122699987
Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 
sum: 1546639.6964531767
Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 
sum: 45711.881143417035
Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 
sum: 1042600

File Statistics:
  Column 0: count: 104260 hasNull: false
  Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
  Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
  Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: 738629
  Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
  Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
  Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
  Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
  Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
  Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
  Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
  Column 11: count: 104260 hasNull: false 

Re: Apache Flink

2016-04-17 Thread Mark Hamstra
To be fair, the Stratosphere project from which Flink springs was started
as a collaborative university research project in Germany about the same
time that Spark was first released as Open Source, so they are near
contemporaries rather than Flink having been started only well after Spark
was an established and widely-used Apache project.

On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh 
wrote:

> Also it always amazes me why they are so many tangential projects in Big
> Data space? Would not it be easier if efforts were spent on adding to Spark
> functionality rather than creating a new product like Flink?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 21:08, Mich Talebzadeh 
> wrote:
>
>> Thanks Corey for the useful info.
>>
>> I have used Sybase Aleri and StreamBase as commercial CEPs engines.
>> However, there does not seem to be anything close to these products in
>> Hadoop Ecosystem. So I guess there is nothing there?
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 20:43, Corey Nolet  wrote:
>>
>>> i have not been intrigued at all by the microbatching concept in Spark.
>>> I am used to CEP in real streams processing environments like Infosphere
>>> Streams & Storm where the granularity of processing is at the level of each
>>> individual tuple and processing units (workers) can react immediately to
>>> events being received and processed. The closest Spark streaming comes to
>>> this concept is the notion of "state" that that can be updated via the
>>> "updateStateBykey()" functions which are only able to be run in a
>>> microbatch. Looking at the expected design changes to Spark Streaming in
>>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>>> the radar for Spark, though I have seen articles stating that more effort
>>> is going to go into the Spark SQL layer in Spark streaming which may make
>>> it more reminiscent of Esper.
>>>
>>> For these reasons, I have not even tried to implement CEP in Spark. I
>>> feel it's a waste of time without immediate tuple-at-a-time processing.
>>> Without this, they avoid the whole problem of "back pressure" (though keep
>>> in mind, it is still very possible to overload the Spark streaming layer
>>> with stages that will continue to pile up and never get worked off) but
>>> they lose the granular control that you get in CEP environments by allowing
>>> the rules & processors to react with the receipt of each tuple, right away.
>>>
>>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>>> [1] on top of Apache Storm as an example of what such a design may look
>>> like. It looks like Storm is going to be replaced in the not so distant
>>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>>> open source implementation as of yet.
>>>
>>> [1] https://github.com/calrissian/flowmix
>>>
>>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Corey,

 Can you please point me to docs on using Spark for CEP? Do we have a
 set of CEP libraries somewhere. I am keen on getting hold of adaptor
 libraries for Spark something like below



 ​
 Thanks


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 17 April 2016 at 16:07, Corey Nolet  wrote:

> One thing I've noticed about Flink in my following of the project has
> been that it has established, in a few cases, some novel ideas and
> improvements over Spark. The problem with it, however, is that both the
> development team and the community around it are very small and many of
> those novel improvements have been rolled directly into Spark in 
> subsequent
> versions. I was considering changing over my architecture to Flink at one
> point to get better, more real-time CEP streaming support, but in the end 
> I
> decided to stick with Spark and just watch Flink continue to pressure it
> into improvement.
>
> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
> wrote:
>
>> i never found much info that flink was actually designed to be fault

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Also it always amazes me why they are so many tangential projects in Big
Data space? Would not it be easier if efforts were spent on adding to Spark
functionality rather than creating a new product like Flink?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 21:08, Mich Talebzadeh 
wrote:

> Thanks Corey for the useful info.
>
> I have used Sybase Aleri and StreamBase as commercial CEPs engines.
> However, there does not seem to be anything close to these products in
> Hadoop Ecosystem. So I guess there is nothing there?
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 20:43, Corey Nolet  wrote:
>
>> i have not been intrigued at all by the microbatching concept in Spark. I
>> am used to CEP in real streams processing environments like Infosphere
>> Streams & Storm where the granularity of processing is at the level of each
>> individual tuple and processing units (workers) can react immediately to
>> events being received and processed. The closest Spark streaming comes to
>> this concept is the notion of "state" that that can be updated via the
>> "updateStateBykey()" functions which are only able to be run in a
>> microbatch. Looking at the expected design changes to Spark Streaming in
>> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
>> the radar for Spark, though I have seen articles stating that more effort
>> is going to go into the Spark SQL layer in Spark streaming which may make
>> it more reminiscent of Esper.
>>
>> For these reasons, I have not even tried to implement CEP in Spark. I
>> feel it's a waste of time without immediate tuple-at-a-time processing.
>> Without this, they avoid the whole problem of "back pressure" (though keep
>> in mind, it is still very possible to overload the Spark streaming layer
>> with stages that will continue to pile up and never get worked off) but
>> they lose the granular control that you get in CEP environments by allowing
>> the rules & processors to react with the receipt of each tuple, right away.
>>
>> Awhile back, I did attempt to implement an InfoSphere Streams-like API
>> [1] on top of Apache Storm as an example of what such a design may look
>> like. It looks like Storm is going to be replaced in the not so distant
>> future by Twitter's new design called Heron. IIRC, Heron does not have an
>> open source implementation as of yet.
>>
>> [1] https://github.com/calrissian/flowmix
>>
>> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Corey,
>>>
>>> Can you please point me to docs on using Spark for CEP? Do we have a set
>>> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
>>> for Spark something like below
>>>
>>>
>>>
>>> ​
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>>>
 One thing I've noticed about Flink in my following of the project has
 been that it has established, in a few cases, some novel ideas and
 improvements over Spark. The problem with it, however, is that both the
 development team and the community around it are very small and many of
 those novel improvements have been rolled directly into Spark in subsequent
 versions. I was considering changing over my architecture to Flink at one
 point to get better, more real-time CEP streaming support, but in the end I
 decided to stick with Spark and just watch Flink continue to pressure it
 into improvement.

 On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
 wrote:

> i never found much info that flink was actually designed to be fault
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
> doesn't bode well for large scale data processing. spark was designed with
> fault tolerance in mind from the beginning.
>
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I read the benchmark published by Yahoo. Obviously they already use
>> Storm and inevitably very familiar with that tool. To start with although
>> these benchmarks were somehow interesting IMO, it lend itself to an

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
Hi Maurin,

Have you tried to create your table in Hive as parquet table? This table is
pretty small with 100K rows.

Is Hive table bucketed at all? I gather your issue is inserting rows into
Hive table at the moment that taking longer time (compared to Parquet)?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 21:43, Maurin Lenglart  wrote:

> Hi,
> I am using cloudera distribution, and when I do a" desc formatted table” I
> don t get all the table parameters.
>
> But I did a hive orcfiledump on one random file ( I replaced some of the
> values that can be sensible) :
> hive --orcfiledump
> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
> 2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The
> hbase-prefix-tree module jar containing PrefixTreeCodec is not present.
> Continuing without it.
> Structure for /user/hive/warehouse/myDB.db/mytable
> /event_date=2016-04-01/part-1
> File Version: 0.12 with HIVE_8732
> 16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from
> /user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
> with {include: null, offset: 0, length: 9223372036854775807}
> Rows: 104260
> Compression: ZLIB
> Compression size: 262144
> Type: struct and 13 and  >
>
> Stripe Statistics:
>   Stripe 1:
> Column 0: count: 104260 hasNull: false
> Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
> Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
> Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
> Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
> Column 5: count: 104260 hasNull: false min: XXX max: Others sum:
> 262478
> Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
> Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
> Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
> Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum:
> 730846
> Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
> Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
> Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
> 324522.0
> Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
> 6958348.122699987
> Column 14: count: 104260 hasNull: false min: -2273.0 max:
> 39930.13977860418 sum: 1546639.6964531767
> Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
> Column 16: count: 104260 hasNull: false min: 0.0 max:
> 4824.029119913681 sum: 45711.881143417035
> Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
> Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
> Column 19: count: 104260 hasNull: false min: 2016-04-03 max:
> 2016-04-03 sum: 1042600
>
> File Statistics:
>   Column 0: count: 104260 hasNull: false
>   Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
>   Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
>   Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum:
> 738629
>   Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
>   Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
>   Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
>   Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
>   Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
>   Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
>   Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
>   Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
>   Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum:
> 324522.0
>   Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum:
> 6958348.122699987
>   Column 14: count: 104260 hasNull: false min: -2273.0 max:
> 39930.13977860418 sum: 1546639.6964531767
>   Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
>   Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681
> sum: 45711.881143417035
>   Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
>   Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
>   Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03
> sum: 1042600
>
> Stripes:
>   Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665
> Stream: column 0 section ROW_INDEX start: 3 length 21
> Stream: column 1 section ROW_INDEX start: 24 length 148
> Stream: column 2 section ROW_INDEX start: 172 length 160
> Stream: column 3 section 

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
Hi,
I am using cloudera distribution, and when I do a" desc formatted table” I don 
t get all the table parameters.

But I did a hive orcfiledump on one random file ( I replaced some of the values 
that can be sensible) :
hive --orcfiledump 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
2016-04-17 01:36:15,424 WARN  [main] mapreduce.TableMapReduceUtil: The 
hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  
Continuing without it.
Structure for 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1
File Version: 0.12 with HIVE_8732
16/04/17 01:36:18 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse/myDB.db/mytable/event_date=2016-04-01/part-1 with 
{include: null, offset: 0, length: 9223372036854775807}
Rows: 104260
Compression: ZLIB
Compression size: 262144
Type: struct

Stripe Statistics:
  Stripe 1:
Column 0: count: 104260 hasNull: false
Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
Column 3: count: 104260 hasNull: false min: XXX max: XXX sum: 738629
Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
Column 8: count: 104260 hasNull: false min:  max: XXX sum: 1657073
Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 
6958348.122699987
Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 
sum: 1546639.6964531767
Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 
sum: 45711.881143417035
Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 
sum: 1042600

File Statistics:
  Column 0: count: 104260 hasNull: false
  Column 1: count: 104260 hasNull: false min: XXX max: XXX sum: 365456
  Column 2: count: 104260 hasNull: false min: XXX max: XXX sum: 634322
  Column 3: count: 104260 hasNull: false min: XXX max: Unknown Utm sum: 738629
  Column 4: count: 104260 hasNull: false min: 0 max: 1 sum: 37221
  Column 5: count: 104260 hasNull: false min: XXX max: Others sum: 262478
  Column 6: count: 104260 hasNull: false min:  max: XXX sum: 220591
  Column 7: count: 104260 hasNull: false min:  max: XXX sum: 1102288
  Column 8: count: 104260 hasNull: false min:  max: Travel sum: 1657073
  Column 9: count: 104260 hasNull: false min: 1. XXX max: NULL sum: 730846
  Column 10: count: 104260 hasNull: false min: 0 max: 152963 sum: 5481629
  Column 11: count: 104260 hasNull: false min: 0 max: 45481 sum: 625946
  Column 12: count: 104260 hasNull: false min: 0.0 max: 40220.0 sum: 324522.0
  Column 13: count: 104260 hasNull: false min: 0.0 max: 201100.0 sum: 
6958348.122699987
  Column 14: count: 104260 hasNull: false min: -2273.0 max: 39930.13977860418 
sum: 1546639.6964531767
  Column 15: count: 104260 hasNull: false min: 0 max: 39269 sum: 233814
  Column 16: count: 104260 hasNull: false min: 0.0 max: 4824.029119913681 sum: 
45711.881143417035
  Column 17: count: 104260 hasNull: false min: 0 max: 46883 sum: 437866
  Column 18: count: 104260 hasNull: false min: 0 max: 14402 sum: 33864
  Column 19: count: 104260 hasNull: false min: 2016-04-03 max: 2016-04-03 sum: 
1042600

Stripes:
  Stripe: offset: 3 data: 909118 rows: 104260 tail: 325 index: 3665
Stream: column 0 section ROW_INDEX start: 3 length 21
Stream: column 1 section ROW_INDEX start: 24 length 148
Stream: column 2 section ROW_INDEX start: 172 length 160
Stream: column 3 section ROW_INDEX start: 332 length 168
Stream: column 4 section ROW_INDEX start: 500 length 133
Stream: column 5 section ROW_INDEX start: 633 length 152
Stream: column 6 section ROW_INDEX start: 785 length 141
Stream: column 7 section ROW_INDEX start: 926 length 165
Stream: column 8 section ROW_INDEX start: 1091 length 150
Stream: column 9 section ROW_INDEX start: 1241 length 160
Stream: column 10 section ROW_INDEX start: 1401 length 205
Stream: column 11 section ROW_INDEX start: 1606 length 200
Stream: column 12 section ROW_INDEX start: 1806 length 201
Stream: column 13 section ROW_INDEX start: 2007 length 292
Stream: column 14 section ROW_INDEX start: 2299 length 375
Stream: column 15 section ROW_INDEX start: 

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Thanks Corey for the useful info.

I have used Sybase Aleri and StreamBase as commercial CEPs engines.
However, there does not seem to be anything close to these products in
Hadoop Ecosystem. So I guess there is nothing there?

Regards.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 20:43, Corey Nolet  wrote:

> i have not been intrigued at all by the microbatching concept in Spark. I
> am used to CEP in real streams processing environments like Infosphere
> Streams & Storm where the granularity of processing is at the level of each
> individual tuple and processing units (workers) can react immediately to
> events being received and processed. The closest Spark streaming comes to
> this concept is the notion of "state" that that can be updated via the
> "updateStateBykey()" functions which are only able to be run in a
> microbatch. Looking at the expected design changes to Spark Streaming in
> Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
> the radar for Spark, though I have seen articles stating that more effort
> is going to go into the Spark SQL layer in Spark streaming which may make
> it more reminiscent of Esper.
>
> For these reasons, I have not even tried to implement CEP in Spark. I feel
> it's a waste of time without immediate tuple-at-a-time processing. Without
> this, they avoid the whole problem of "back pressure" (though keep in mind,
> it is still very possible to overload the Spark streaming layer with stages
> that will continue to pile up and never get worked off) but they lose the
> granular control that you get in CEP environments by allowing the rules &
> processors to react with the receipt of each tuple, right away.
>
> Awhile back, I did attempt to implement an InfoSphere Streams-like API [1]
> on top of Apache Storm as an example of what such a design may look like.
> It looks like Storm is going to be replaced in the not so distant future by
> Twitter's new design called Heron. IIRC, Heron does not have an open source
> implementation as of yet.
>
> [1] https://github.com/calrissian/flowmix
>
> On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Corey,
>>
>> Can you please point me to docs on using Spark for CEP? Do we have a set
>> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
>> for Spark something like below
>>
>>
>>
>> ​
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>>
>>> One thing I've noticed about Flink in my following of the project has
>>> been that it has established, in a few cases, some novel ideas and
>>> improvements over Spark. The problem with it, however, is that both the
>>> development team and the community around it are very small and many of
>>> those novel improvements have been rolled directly into Spark in subsequent
>>> versions. I was considering changing over my architecture to Flink at one
>>> point to get better, more real-time CEP streaming support, but in the end I
>>> decided to stick with Spark and just watch Flink continue to pressure it
>>> into improvement.
>>>
>>> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
>>> wrote:
>>>
 i never found much info that flink was actually designed to be fault
 tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
 doesn't bode well for large scale data processing. spark was designed with
 fault tolerance in mind from the beginning.

 On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> I read the benchmark published by Yahoo. Obviously they already use
> Storm and inevitably very familiar with that tool. To start with although
> these benchmarks were somehow interesting IMO, it lend itself to an
> assurance that the tool chosen for their platform is still the best 
> choice.
> So inevitably the benchmarks and the tests were done to support
> primary their approach.
>
> In general anything which is not done through TCP Council or similar
> body is questionable..
> Their argument is that because Spark handles data streaming in micro
> batches then inevitably it introduces this in-built latency as per design.
> In contrast, both Storm and Flink do not (at the face value) have this
> issue.
>
> In addition as we already know Spark has far more capabilities
> compared to Flink (know 

Re: Apache Flink

2016-04-17 Thread Corey Nolet
i have not been intrigued at all by the microbatching concept in Spark. I
am used to CEP in real streams processing environments like Infosphere
Streams & Storm where the granularity of processing is at the level of each
individual tuple and processing units (workers) can react immediately to
events being received and processed. The closest Spark streaming comes to
this concept is the notion of "state" that that can be updated via the
"updateStateBykey()" functions which are only able to be run in a
microbatch. Looking at the expected design changes to Spark Streaming in
Spark 2.0.0, it also does not look like tuple-at-a-time processing is on
the radar for Spark, though I have seen articles stating that more effort
is going to go into the Spark SQL layer in Spark streaming which may make
it more reminiscent of Esper.

For these reasons, I have not even tried to implement CEP in Spark. I feel
it's a waste of time without immediate tuple-at-a-time processing. Without
this, they avoid the whole problem of "back pressure" (though keep in mind,
it is still very possible to overload the Spark streaming layer with stages
that will continue to pile up and never get worked off) but they lose the
granular control that you get in CEP environments by allowing the rules &
processors to react with the receipt of each tuple, right away.

Awhile back, I did attempt to implement an InfoSphere Streams-like API [1]
on top of Apache Storm as an example of what such a design may look like.
It looks like Storm is going to be replaced in the not so distant future by
Twitter's new design called Heron. IIRC, Heron does not have an open source
implementation as of yet.

[1] https://github.com/calrissian/flowmix

On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh 
wrote:

> Hi Corey,
>
> Can you please point me to docs on using Spark for CEP? Do we have a set
> of CEP libraries somewhere. I am keen on getting hold of adaptor libraries
> for Spark something like below
>
>
>
> ​
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 16:07, Corey Nolet  wrote:
>
>> One thing I've noticed about Flink in my following of the project has
>> been that it has established, in a few cases, some novel ideas and
>> improvements over Spark. The problem with it, however, is that both the
>> development team and the community around it are very small and many of
>> those novel improvements have been rolled directly into Spark in subsequent
>> versions. I was considering changing over my architecture to Flink at one
>> point to get better, more real-time CEP streaming support, but in the end I
>> decided to stick with Spark and just watch Flink continue to pressure it
>> into improvement.
>>
>> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers 
>> wrote:
>>
>>> i never found much info that flink was actually designed to be fault
>>> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
>>> doesn't bode well for large scale data processing. spark was designed with
>>> fault tolerance in mind from the beginning.
>>>
>>> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 I read the benchmark published by Yahoo. Obviously they already use
 Storm and inevitably very familiar with that tool. To start with although
 these benchmarks were somehow interesting IMO, it lend itself to an
 assurance that the tool chosen for their platform is still the best choice.
 So inevitably the benchmarks and the tests were done to support
 primary their approach.

 In general anything which is not done through TCP Council or similar
 body is questionable..
 Their argument is that because Spark handles data streaming in micro
 batches then inevitably it introduces this in-built latency as per design.
 In contrast, both Storm and Flink do not (at the face value) have this
 issue.

 In addition as we already know Spark has far more capabilities compared
 to Flink (know nothing about Storm). So really it boils down to the
 business SLA to choose which tool one wants to deploy for your use case.
 IMO Spark micro batching approach is probably OK for 99% of use cases. If
 we had in built libraries for CEP for Spark (I am searching for it), I
 would not bother with Flink.

 HTH


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 17 April 2016 at 12:47, 

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Hi Corey,

Can you please point me to docs on using Spark for CEP? Do we have a set of
CEP libraries somewhere. I am keen on getting hold of adaptor libraries for
Spark something like below



​
Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 16:07, Corey Nolet  wrote:

> One thing I've noticed about Flink in my following of the project has been
> that it has established, in a few cases, some novel ideas and improvements
> over Spark. The problem with it, however, is that both the development team
> and the community around it are very small and many of those novel
> improvements have been rolled directly into Spark in subsequent versions. I
> was considering changing over my architecture to Flink at one point to get
> better, more real-time CEP streaming support, but in the end I decided to
> stick with Spark and just watch Flink continue to pressure it into
> improvement.
>
> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers  wrote:
>
>> i never found much info that flink was actually designed to be fault
>> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
>> doesn't bode well for large scale data processing. spark was designed with
>> fault tolerance in mind from the beginning.
>>
>> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I read the benchmark published by Yahoo. Obviously they already use
>>> Storm and inevitably very familiar with that tool. To start with although
>>> these benchmarks were somehow interesting IMO, it lend itself to an
>>> assurance that the tool chosen for their platform is still the best choice.
>>> So inevitably the benchmarks and the tests were done to support
>>> primary their approach.
>>>
>>> In general anything which is not done through TCP Council or similar
>>> body is questionable..
>>> Their argument is that because Spark handles data streaming in micro
>>> batches then inevitably it introduces this in-built latency as per design.
>>> In contrast, both Storm and Flink do not (at the face value) have this
>>> issue.
>>>
>>> In addition as we already know Spark has far more capabilities compared
>>> to Flink (know nothing about Storm). So really it boils down to the
>>> business SLA to choose which tool one wants to deploy for your use case.
>>> IMO Spark micro batching approach is probably OK for 99% of use cases. If
>>> we had in built libraries for CEP for Spark (I am searching for it), I
>>> would not bother with Flink.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.ma...@inria.fr> wrote:
>>>
 You probably read this benchmark at Yahoo, any comments from Spark?

 https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at


 On 17 Apr 2016, at 12:41, andy petrella 
 wrote:

 Just adding one thing to the mix: `that the latency for streaming data
 is eliminated` is insane :-D

 On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

>  It seems that Flink argues that the latency for streaming data is
> eliminated whereas with Spark RDD there is this latency.
>
> I noticed that Flink does not support interactive shell much like
> Spark shell where you can add jars to it to do kafka testing. The advice
> was to add the streaming Kafka jar file to CLASSPATH but that does not 
> work.
>
> Most Flink documentation also rather sparce with the usual example of
> word count which is not exactly what you want.
>
> Anyway I will have a look at it further. I have a Spark Scala
> streaming Kafka program that works fine in Spark and I want to recode it
> using Scala for Flink with Kafka but have difficulty importing and testing
> libraries.
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 02:41, Ascot Moss  wrote:
>
>> I compared both last month, seems to me that Flink's MLLib is not yet
>> ready.
>>
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
>> 

Re: Spark support for Complex Event Processing (CEP)

2016-04-17 Thread Mich Talebzadeh
Thanks Luciano. Appreciated.

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 17:32, Luciano Resende  wrote:

> Hi Mitch,
>
> I know some folks that were investigating/prototyping on this area, let me
> see if I can get them to reply here with more details.
>
> On Sun, Apr 17, 2016 at 1:54 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Has Spark got libraries for CEP using Spark Streaming with Kafka by any
>> chance?
>>
>> I am looking at Flink that supposed to have these libraries for CEP but I
>> find Flink itself very much work in progress.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Spark support for Complex Event Processing (CEP)

2016-04-17 Thread Luciano Resende
Hi Mitch,

I know some folks that were investigating/prototyping on this area, let me
see if I can get them to reply here with more details.

On Sun, Apr 17, 2016 at 1:54 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Has Spark got libraries for CEP using Spark Streaming with Kafka by any
> chance?
>
> I am looking at Flink that supposed to have these libraries for CEP but I
> find Flink itself very much work in progress.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
The Streaming use case is important IMO, as Spark (like Flink) advocates for 
the unification of analytics tools, so having all in one, batch and graph 
processing, sql, ml and streaming.

> On 17 Apr 2016, at 17:07, Corey Nolet  wrote:
> 
> One thing I've noticed about Flink in my following of the project has been 
> that it has established, in a few cases, some novel ideas and improvements 
> over Spark. The problem with it, however, is that both the development team 
> and the community around it are very small and many of those novel 
> improvements have been rolled directly into Spark in subsequent versions. I 
> was considering changing over my architecture to Flink at one point to get 
> better, more real-time CEP streaming support, but in the end I decided to 
> stick with Spark and just watch Flink continue to pressure it into 
> improvement.
> 
> On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers  > wrote:
> i never found much info that flink was actually designed to be fault 
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that 
> doesn't bode well for large scale data processing. spark was designed with 
> fault tolerance in mind from the beginning.
> 
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh  > wrote:
> Hi,
> 
> I read the benchmark published by Yahoo. Obviously they already use Storm and 
> inevitably very familiar with that tool. To start with although these 
> benchmarks were somehow interesting IMO, it lend itself to an assurance that 
> the tool chosen for their platform is still the best choice. So inevitably 
> the benchmarks and the tests were done to support primary their approach.
> 
> In general anything which is not done through TCP Council or similar body is 
> questionable..
> Their argument is that because Spark handles data streaming in micro batches 
> then inevitably it introduces this in-built latency as per design. In 
> contrast, both Storm and Flink do not (at the face value) have this issue.
> 
> In addition as we already know Spark has far more capabilities compared to 
> Flink (know nothing about Storm). So really it boils down to the business SLA 
> to choose which tool one wants to deploy for your use case. IMO Spark micro 
> batching approach is probably OK for 99% of use cases. If we had in built 
> libraries for CEP for Spark (I am searching for it), I would not bother with 
> Flink.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU 
> > 
> wrote:
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> 
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella > > wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh > > wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 17 April 2016 at 02:41, Ascot Moss > > wrote:
>> I compared both last month, seems to me that Flink's MLLib is not yet ready.
>> 
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh > > wrote:
>> Thanks Ted. I was 

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
For the streaming case Flink is fault tolerant (DataStream API), for the batch 
case (DataSet API) not yet, as from my research regarding their platform.

> On 17 Apr 2016, at 17:03, Koert Kuipers  wrote:
> 
> i never found much info that flink was actually designed to be fault 
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that 
> doesn't bode well for large scale data processing. spark was designed with 
> fault tolerance in mind from the beginning.
> 
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh  > wrote:
> Hi,
> 
> I read the benchmark published by Yahoo. Obviously they already use Storm and 
> inevitably very familiar with that tool. To start with although these 
> benchmarks were somehow interesting IMO, it lend itself to an assurance that 
> the tool chosen for their platform is still the best choice. So inevitably 
> the benchmarks and the tests were done to support primary their approach.
> 
> In general anything which is not done through TCP Council or similar body is 
> questionable..
> Their argument is that because Spark handles data streaming in micro batches 
> then inevitably it introduces this in-built latency as per design. In 
> contrast, both Storm and Flink do not (at the face value) have this issue.
> 
> In addition as we already know Spark has far more capabilities compared to 
> Flink (know nothing about Storm). So really it boils down to the business SLA 
> to choose which tool one wants to deploy for your use case. IMO Spark micro 
> batching approach is probably OK for 99% of use cases. If we had in built 
> libraries for CEP for Spark (I am searching for it), I would not bother with 
> Flink.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU 
> > 
> wrote:
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> 
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella > > wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh > > wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 17 April 2016 at 02:41, Ascot Moss > > wrote:
>> I compared both last month, seems to me that Flink's MLLib is not yet ready.
>> 
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh > > wrote:
>> Thanks Ted. I was wondering if someone is using both :)
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 16 April 2016 at 17:08, Ted Yu > > wrote:
>> Looks like this question is more relevant on flink mailing list :-)
>> 
>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh > > wrote:
>> Hi,
>> 
>> Has anyone used Apache Flink instead of Spark by any chance
>> 
>> I am interested in its set of libraries 

Docker Mesos Spark Port Mapping

2016-04-17 Thread John Omernik
The setting

spark.mesos.executor.docker.portmaps

Is interesting to me, without this setting, the docker executor uses
net=host and thus port mappings are not needed.

With this setting, (and just adding some random mappings) my executors fail
with less then helpful messages.

I guess some questions here


1. If I specify port mappings is there an "implied" net=bridge that happens
with my executors? Since they are failing fast, I really can't see the
command to determine what the net setting is.

2. If 1 = true, then what port mappings do I need to run this in net=bridge
mode. This is actually preferred for me in that I am using MapR FS and it
doesn't seem to like running a mapr client (which Spark is using to access
the Filesystem) in a docker container on a MapR FS Node,  where net =
host.  I can do it where net=bridge and that works for MApR, but not
net=host, and thus, I am hoping I have some options for net=bridge.

Thoughts?


John


Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
Hi Mich,

IMO one will try to see if there is an alternative, a better one at least.
This benchmark could be a good starting point.

Best,
Ovidiu
> On 17 Apr 2016, at 15:52, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I read the benchmark published by Yahoo. Obviously they already use Storm and 
> inevitably very familiar with that tool. To start with although these 
> benchmarks were somehow interesting IMO, it lend itself to an assurance that 
> the tool chosen for their platform is still the best choice. So inevitably 
> the benchmarks and the tests were done to support primary their approach.
> 
> In general anything which is not done through TCP Council or similar body is 
> questionable..
> Their argument is that because Spark handles data streaming in micro batches 
> then inevitably it introduces this in-built latency as per design. In 
> contrast, both Storm and Flink do not (at the face value) have this issue.
> 
> In addition as we already know Spark has far more capabilities compared to 
> Flink (know nothing about Storm). So really it boils down to the business SLA 
> to choose which tool one wants to deploy for your use case. IMO Spark micro 
> batching approach is probably OK for 99% of use cases. If we had in built 
> libraries for CEP for Spark (I am searching for it), I would not bother with 
> Flink.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU 
> > 
> wrote:
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> 
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella > > wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh > > wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 17 April 2016 at 02:41, Ascot Moss > > wrote:
>> I compared both last month, seems to me that Flink's MLLib is not yet ready.
>> 
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh > > wrote:
>> Thanks Ted. I was wondering if someone is using both :)
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 16 April 2016 at 17:08, Ted Yu > > wrote:
>> Looks like this question is more relevant on flink mailing list :-)
>> 
>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh > > wrote:
>> Hi,
>> 
>> Has anyone used Apache Flink instead of Spark by any chance
>> 
>> I am interested in its set of libraries for Complex Event Processing.
>> 
>> Frankly I don't know if it offers far more than Spark offers.
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
Yes, mostly regarding spark partitioning and use of groupByKey instead of 
reduceByKey.
However, Flink extended the benchmark here 
http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ 

So I was curious about an answer from Spark team, do they plan to do something 
similar.

> On 17 Apr 2016, at 15:33, Silvio Fiorito  
> wrote:
> 
> Actually there were multiple responses to it on the GitHub project, including 
> a PR to improve the Spark code, but they weren’t acknowledged.
>  
>  
> From: Ovidiu-Cristian MARCU 
> Sent: Sunday, April 17, 2016 7:48 AM
> To: andy petrella 
> Cc: Mich Talebzadeh ; Ascot Moss 
> ; Ted Yu ; user 
> @spark 
> Subject: Re: Apache Flink
>  
> You probably read this benchmark at Yahoo, any comments from Spark?
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>  
> 
> 
> 
>> On 17 Apr 2016, at 12:41, andy petrella > > wrote:
>> 
>> Just adding one thing to the mix: `that the latency for streaming data is 
>> eliminated` is insane :-D
>> 
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh > > wrote:
>>  It seems that Flink argues that the latency for streaming data is 
>> eliminated whereas with Spark RDD there is this latency.
>> 
>> I noticed that Flink does not support interactive shell much like Spark 
>> shell where you can add jars to it to do kafka testing. The advice was to 
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>> 
>> Most Flink documentation also rather sparce with the usual example of word 
>> count which is not exactly what you want.
>> 
>> Anyway I will have a look at it further. I have a Spark Scala streaming 
>> Kafka program that works fine in Spark and I want to recode it using Scala 
>> for Flink with Kafka but have difficulty importing and testing libraries.
>> 
>> Cheers
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 17 April 2016 at 02:41, Ascot Moss > > wrote:
>> I compared both last month, seems to me that Flink's MLLib is not yet ready.
>> 
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh > > wrote:
>> Thanks Ted. I was wondering if someone is using both :)
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 16 April 2016 at 17:08, Ted Yu > > wrote:
>> Looks like this question is more relevant on flink mailing list :-)
>> 
>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh > > wrote:
>> Hi,
>> 
>> Has anyone used Apache Flink instead of Spark by any chance
>> 
>> I am interested in its set of libraries for Complex Event Processing.
>> 
>> Frankly I don't know if it offers far more than Spark offers.
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> 
>> 
>> 
>> -- 
>> andy



Re: JSON Usage

2016-04-17 Thread Benjamin Kim
Hyukjin,

This is what I did so far. I didn’t use DataSet yet or maybe I don’t need to.

var df: DataFrame = null
for(message <- messages) {
val bodyRdd = sc.parallelize(message.getBody() :: Nil)
val fileDf = sqlContext.read.json(bodyRdd)
.select(
$"Records.s3.bucket.name".as("bucket"),
$"Records.s3.object.key".as("key")
)
if (df != null) {
  df = df.unionAll(fileDf)
} else {
  df = fileDf
}
}
df.show

Each result is returned as an array. I just need to concatenate them together 
to make the S3 URL, and download the files per URL. This I need help with next.

Thanks,
Ben

> On Apr 17, 2016, at 7:38 AM, Hyukjin Kwon  wrote:
> 
> Hi!
> 
> Personally, I don't think it necessarily needs to be DataSet for your goal.
> 
> Just select your data at "s3" from DataFrame loaded by sqlContext.read.json().
> 
> You can try to printSchema() to check the nested schema and then select the 
> data.
> 
> Also, I guess (from your codes) you are trying to send a reauest, fetch the 
> response to driver-side, and then send each message to executor-side. I guess 
> there would be really heavy overhead in driver-side.
> Holden,
> 
> If I were to use DataSets, then I would essentially do this:
> 
> val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl)
> val messages = sqs.receiveMessage(receiveMessageRequest).getMessages()
> for (message <- messages.asScala) {
> val files = sqlContext.read.json(message.getBody())
> }
> 
> Can I simply do files.toDS() or do I have to create a schema using a case 
> class File and apply it as[File]? If I have to apply a schema, then how would 
> I create it based on the JSON structure below, especially the nested elements.
> 
> Thanks,
> Ben
> 
> 
>> On Apr 14, 2016, at 3:46 PM, Holden Karau > > wrote:
>> 
>> You could certainly use RDDs for that, you might also find using Dataset 
>> selecting the fields you need to construct the URL to fetch and then using 
>> the map function to be easier.
>> 
>> On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim > > wrote:
>> I was wonder what would be the best way to use JSON in Spark/Scala. I need 
>> to lookup values of fields in a collection of records to form a URL and 
>> download that file at that location. I was thinking an RDD would be perfect 
>> for this. I just want to hear from others who might have more experience in 
>> this. Below is the actual JSON structure that I am trying to use for the S3 
>> bucket and key values of each “record" within “Records".
>> 
>> {
>>"Records":[
>>   {
>>  "eventVersion":"2.0",
>>  "eventSource":"aws:s3",
>>  "awsRegion":"us-east-1",
>>  "eventTime":The time, in ISO-8601 format, for example, 
>> 1970-01-01T00:00:00.000Z, when S3 finished processing the request,
>>  "eventName":"event-type",
>>  "userIdentity":{
>> 
>> "principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
>>  },
>>  "requestParameters":{
>> "sourceIPAddress":"ip-address-where-request-came-from"
>>  },
>>  "responseElements":{
>> "x-amz-request-id":"Amazon S3 generated request ID",
>> "x-amz-id-2":"Amazon S3 host that processed the request"
>>  },
>>  "s3":{
>> "s3SchemaVersion":"1.0",
>> "configurationId":"ID found in the bucket notification 
>> configuration",
>> "bucket":{
>>"name":"bucket-name",
>>"ownerIdentity":{
>>   "principalId":"Amazon-customer-ID-of-the-bucket-owner"
>>},
>>"arn":"bucket-ARN"
>> },
>> "object":{
>>"key":"object-key",
>>"size":object-size,
>>"eTag":"object eTag",
>>"versionId":"object version if bucket is versioning-enabled, 
>> otherwise null",
>>"sequencer": "a string representation of a hexadecimal value 
>> used to determine event sequence,
>>only used with PUTs and DELETEs"
>> }
>>  }
>>   },
>>   {
>>   // Additional events
>>   }
>>]
>> }
>> 
>> Thanks
>> Ben
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau 



Re: Apache Flink

2016-04-17 Thread Corey Nolet
One thing I've noticed about Flink in my following of the project has been
that it has established, in a few cases, some novel ideas and improvements
over Spark. The problem with it, however, is that both the development team
and the community around it are very small and many of those novel
improvements have been rolled directly into Spark in subsequent versions. I
was considering changing over my architecture to Flink at one point to get
better, more real-time CEP streaming support, but in the end I decided to
stick with Spark and just watch Flink continue to pressure it into
improvement.

On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers  wrote:

> i never found much info that flink was actually designed to be fault
> tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
> doesn't bode well for large scale data processing. spark was designed with
> fault tolerance in mind from the beginning.
>
> On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> I read the benchmark published by Yahoo. Obviously they already use Storm
>> and inevitably very familiar with that tool. To start with although these
>> benchmarks were somehow interesting IMO, it lend itself to an assurance
>> that the tool chosen for their platform is still the best choice. So
>> inevitably the benchmarks and the tests were done to support primary their
>> approach.
>>
>> In general anything which is not done through TCP Council or similar body
>> is questionable..
>> Their argument is that because Spark handles data streaming in micro
>> batches then inevitably it introduces this in-built latency as per design.
>> In contrast, both Storm and Flink do not (at the face value) have this
>> issue.
>>
>> In addition as we already know Spark has far more capabilities compared
>> to Flink (know nothing about Storm). So really it boils down to the
>> business SLA to choose which tool one wants to deploy for your use case.
>> IMO Spark micro batching approach is probably OK for 99% of use cases. If
>> we had in built libraries for CEP for Spark (I am searching for it), I
>> would not bother with Flink.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> You probably read this benchmark at Yahoo, any comments from Spark?
>>>
>>> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>>>
>>>
>>> On 17 Apr 2016, at 12:41, andy petrella  wrote:
>>>
>>> Just adding one thing to the mix: `that the latency for streaming data
>>> is eliminated` is insane :-D
>>>
>>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
  It seems that Flink argues that the latency for streaming data is
 eliminated whereas with Spark RDD there is this latency.

 I noticed that Flink does not support interactive shell much like Spark
 shell where you can add jars to it to do kafka testing. The advice was to
 add the streaming Kafka jar file to CLASSPATH but that does not work.

 Most Flink documentation also rather sparce with the usual example of
 word count which is not exactly what you want.

 Anyway I will have a look at it further. I have a Spark Scala streaming
 Kafka program that works fine in Spark and I want to recode it using Scala
 for Flink with Kafka but have difficulty importing and testing libraries.

 Cheers

 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com



 On 17 April 2016 at 02:41, Ascot Moss  wrote:

> I compared both last month, seems to me that Flink's MLLib is not yet
> ready.
>
> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Ted. I was wondering if someone is using both :)
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 April 2016 at 17:08, Ted Yu  wrote:
>>
>>> Looks like this question is more relevant on flink mailing list :-)
>>>
>>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
>>> 

Re: Apache Flink

2016-04-17 Thread Koert Kuipers
i never found much info that flink was actually designed to be fault
tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that
doesn't bode well for large scale data processing. spark was designed with
fault tolerance in mind from the beginning.

On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> I read the benchmark published by Yahoo. Obviously they already use Storm
> and inevitably very familiar with that tool. To start with although these
> benchmarks were somehow interesting IMO, it lend itself to an assurance
> that the tool chosen for their platform is still the best choice. So
> inevitably the benchmarks and the tests were done to support primary their
> approach.
>
> In general anything which is not done through TCP Council or similar body
> is questionable..
> Their argument is that because Spark handles data streaming in micro
> batches then inevitably it introduces this in-built latency as per design.
> In contrast, both Storm and Flink do not (at the face value) have this
> issue.
>
> In addition as we already know Spark has far more capabilities compared to
> Flink (know nothing about Storm). So really it boils down to the business
> SLA to choose which tool one wants to deploy for your use case. IMO Spark
> micro batching approach is probably OK for 99% of use cases. If we had in
> built libraries for CEP for Spark (I am searching for it), I would not
> bother with Flink.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> You probably read this benchmark at Yahoo, any comments from Spark?
>>
>> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>>
>>
>> On 17 Apr 2016, at 12:41, andy petrella  wrote:
>>
>> Just adding one thing to the mix: `that the latency for streaming data is
>> eliminated` is insane :-D
>>
>> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>  It seems that Flink argues that the latency for streaming data is
>>> eliminated whereas with Spark RDD there is this latency.
>>>
>>> I noticed that Flink does not support interactive shell much like Spark
>>> shell where you can add jars to it to do kafka testing. The advice was to
>>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>>>
>>> Most Flink documentation also rather sparce with the usual example of
>>> word count which is not exactly what you want.
>>>
>>> Anyway I will have a look at it further. I have a Spark Scala streaming
>>> Kafka program that works fine in Spark and I want to recode it using Scala
>>> for Flink with Kafka but have difficulty importing and testing libraries.
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 April 2016 at 02:41, Ascot Moss  wrote:
>>>
 I compared both last month, seems to me that Flink's MLLib is not yet
 ready.

 On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Thanks Ted. I was wondering if someone is using both :)
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 April 2016 at 17:08, Ted Yu  wrote:
>
>> Looks like this question is more relevant on flink mailing list :-)
>>
>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has anyone used Apache Flink instead of Spark by any chance
>>>
>>> I am interested in its set of libraries for Complex Event Processing.
>>>
>>> Frankly I don't know if it offers far more than Spark offers.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>

>>> --
>> andy
>>
>>
>>
>


Re: JSON Usage

2016-04-17 Thread Hyukjin Kwon
Hi!

Personally, I don't think it necessarily needs to be DataSet for your goal.

Just select your data at "s3" from DataFrame loaded by
sqlContext.read.json().

You can try to printSchema() to check the nested schema and then select the
data.

Also, I guess (from your codes) you are trying to send a reauest, fetch the
response to driver-side, and then send each message to executor-side. I
guess there would be really heavy overhead in driver-side.
Holden,

If I were to use DataSets, then I would essentially do this:

val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl)
val messages = sqs.receiveMessage(receiveMessageRequest).getMessages()
for (message <- messages.asScala) {
val files = sqlContext.read.json(message.getBody())
}

Can I simply do files.toDS() or do I have to create a schema using a case
class File and apply it as[File]? If I have to apply a schema, then how
would I create it based on the JSON structure below, especially the nested
elements.

Thanks,
Ben


On Apr 14, 2016, at 3:46 PM, Holden Karau  wrote:

You could certainly use RDDs for that, you might also find using Dataset
selecting the fields you need to construct the URL to fetch and then using
the map function to be easier.

On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim  wrote:

> I was wonder what would be the best way to use JSON in Spark/Scala. I need
> to lookup values of fields in a collection of records to form a URL and
> download that file at that location. I was thinking an RDD would be perfect
> for this. I just want to hear from others who might have more experience in
> this. Below is the actual JSON structure that I am trying to use for the S3
> bucket and key values of each “record" within “Records".
>
> {
>"Records":[
>   {
>  "eventVersion":"2.0",
>  "eventSource":"aws:s3",
>  "awsRegion":"us-east-1",
>  "eventTime":The time, in ISO-8601 format, for example,
> 1970-01-01T00:00:00.000Z, when S3 finished processing the request,
>  "eventName":"event-type",
>  "userIdentity":{
>
> "principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
>  },
>  "requestParameters":{
> "sourceIPAddress":"ip-address-where-request-came-from"
>  },
>  "responseElements":{
> "x-amz-request-id":"Amazon S3 generated request ID",
> "x-amz-id-2":"Amazon S3 host that processed the request"
>  },
>  "s3":{
> "s3SchemaVersion":"1.0",
> "configurationId":"ID found in the bucket notification
> configuration",
> "bucket":{
>"name":"bucket-name",
>"ownerIdentity":{
>   "principalId":"Amazon-customer-ID-of-the-bucket-owner"
>},
>"arn":"bucket-ARN"
> },
> "object":{
>"key":"object-key",
>"size":object-size,
>"eTag":"object eTag",
>"versionId":"object version if bucket is
> versioning-enabled, otherwise null",
>"sequencer": "a string representation of a hexadecimal
> value used to determine event sequence,
>only used with PUTs and DELETEs"
> }
>  }
>   },
>   {
>   // Additional events
>   }
>]
> }
>
> Thanks
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Access to Mesos Docker Cmd for Spark Executors

2016-04-17 Thread John Omernik
Hey all,

I was wondering if there is a way to access/edit the command on Spark
Executors while using Docker on Mesos.

The reason is this: I am using the MapR File Client, and the Spark Driver
is trying to execute things as my user "user1" and since the executors are
running as root inside and the MapR file client is trying to get user
information for my user which doesn't exist in the docker container.

This is something I've handled on my cluster by having a user sync file and
a docker file that you can specify the user you are running as in the
command

So for example, when I run Zeppelin, I instead of
$ZEPPELIN_HOME/bin/zeppelin.sh, in Marathon I execute

/mapr/clustername/dockersync.sh user1 && su -c
$ZEPPELIN_HOME/bin/zeppelin.sh user1

Now my user I am running as, and all the groups that user is in is
installed into the docker container and when I start Zeppelin it's running
as that user (with the correct UIDs and GIDs)

so with the Mesos images, I'd like to be able to alter the command and
prepend my docker sync file to install the users so credentials etc can be
preserved.  There may be other uses as well, but since we can access the
image, the volumes, and the port maps, allowing us to edit the command
(have the default be "what works" and if we edit in such a way, it's our
own fault)  could give us the freedom to do things like this... does this
capability exist?

Thanks,

John


Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Hi,

I read the benchmark published by Yahoo. Obviously they already use Storm
and inevitably very familiar with that tool. To start with although these
benchmarks were somehow interesting IMO, it lend itself to an assurance
that the tool chosen for their platform is still the best choice. So
inevitably the benchmarks and the tests were done to support primary their
approach.

In general anything which is not done through TCP Council or similar body
is questionable..
Their argument is that because Spark handles data streaming in micro
batches then inevitably it introduces this in-built latency as per design.
In contrast, both Storm and Flink do not (at the face value) have this
issue.

In addition as we already know Spark has far more capabilities compared to
Flink (know nothing about Storm). So really it boils down to the business
SLA to choose which tool one wants to deploy for your use case. IMO Spark
micro batching approach is probably OK for 99% of use cases. If we had in
built libraries for CEP for Spark (I am searching for it), I would not
bother with Flink.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> You probably read this benchmark at Yahoo, any comments from Spark?
>
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>
>
> On 17 Apr 2016, at 12:41, andy petrella  wrote:
>
> Just adding one thing to the mix: `that the latency for streaming data is
> eliminated` is insane :-D
>
> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>  It seems that Flink argues that the latency for streaming data is
>> eliminated whereas with Spark RDD there is this latency.
>>
>> I noticed that Flink does not support interactive shell much like Spark
>> shell where you can add jars to it to do kafka testing. The advice was to
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>>
>> Most Flink documentation also rather sparce with the usual example of
>> word count which is not exactly what you want.
>>
>> Anyway I will have a look at it further. I have a Spark Scala streaming
>> Kafka program that works fine in Spark and I want to recode it using Scala
>> for Flink with Kafka but have difficulty importing and testing libraries.
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 02:41, Ascot Moss  wrote:
>>
>>> I compared both last month, seems to me that Flink's MLLib is not yet
>>> ready.
>>>
>>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks Ted. I was wondering if someone is using both :)

 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com



 On 16 April 2016 at 17:08, Ted Yu  wrote:

> Looks like this question is more relevant on flink mailing list :-)
>
> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Has anyone used Apache Flink instead of Spark by any chance
>>
>> I am interested in its set of libraries for Complex Event Processing.
>>
>> Frankly I don't know if it offers far more than Spark offers.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>

>>>
>> --
> andy
>
>
>


RE: Apache Flink

2016-04-17 Thread Silvio Fiorito
Actually there were multiple responses to it on the GitHub project, including a 
PR to improve the Spark code, but they weren’t acknowledged.


From: Ovidiu-Cristian MARCU
Sent: Sunday, April 17, 2016 7:48 AM
To: andy petrella
Cc: Mich Talebzadeh; Ascot 
Moss; Ted Yu; user 
@spark
Subject: Re: Apache Flink

You probably read this benchmark at Yahoo, any comments from Spark?
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at


On 17 Apr 2016, at 12:41, andy petrella 
> wrote:

Just adding one thing to the mix: `that the latency for streaming data is 
eliminated` is insane :-D

On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh 
> wrote:
 It seems that Flink argues that the latency for streaming data is eliminated 
whereas with Spark RDD there is this latency.

I noticed that Flink does not support interactive shell much like Spark shell 
where you can add jars to it to do kafka testing. The advice was to add the 
streaming Kafka jar file to CLASSPATH but that does not work.

Most Flink documentation also rather sparce with the usual example of word 
count which is not exactly what you want.

Anyway I will have a look at it further. I have a Spark Scala streaming Kafka 
program that works fine in Spark and I want to recode it using Scala for Flink 
with Kafka but have difficulty importing and testing libraries.

Cheers

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 17 April 2016 at 02:41, Ascot Moss 
> wrote:
I compared both last month, seems to me that Flink's MLLib is not yet ready.

On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh 
> wrote:
Thanks Ted. I was wondering if someone is using both :)

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 16 April 2016 at 17:08, Ted Yu 
> wrote:
Looks like this question is more relevant on flink mailing list :-)

On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh 
> wrote:
Hi,

Has anyone used Apache Flink instead of Spark by any chance

I am interested in its set of libraries for Complex Event Processing.

Frankly I don't know if it offers far more than Spark offers.

Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com






--
andy



Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Mich Talebzadeh
Hi Jorn,

Sure will do.

What Oracle in-memory  offering does is allow the user to store a *copy* of
selected tables, or partitions, in*columnar* format in-memory within the
Oracle Database memory space. All tables are still present in row format
and all copies on storage are in row format. These columnar copies are not
logged nor are they ever persisted to disk.   The Oracle Database optimizer
is aware of the presence and currency of the in-memory copies and
transparently uses them for any analytical style queries that can benefit
from the vastly faster processing speed. This is all completely transparent
to applications.



The primary use case for this capability is to accelerate the analytics
part of mixed OLTP and Analytical workloads by eliminating the need for
most of the Analytics indexes that are typically found in a database that
supports such a mixed workload. Not only does this speed up the analytical
queries by a huge amount (often 100x or more) but the ability to drop many
of the analytical indexes also has a major benefit for OLTP performance.


Now with regard to Hive database, I am not aware of such mixed load work
case. Additionally one might argue that a better indexing strategy will
benefit Hive database performance compared to in-memory offering or SSD.


I guess I will be doing some investigation on that front as well.


HTH


P.S. Do we have any idea what the largest Hive database (schema)  is in
terms of size? Any published results




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 11:29, Jörn Franke  wrote:

>
> You could also explore the in-memory database of 12c . However, I am not
> sure how beneficial it is for Oltp scenarios.
>
> I am excited to see how the performance will be on hbase as a hive
> metastore.
>
> Nevertheless, your results on Oracle/SSD will be beneficial for the
> community.
>
> On 17 Apr 2016, at 11:52, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I have had my Hive metastore database on Oracle 11g supporting concurrency
> (with added transactional capability)
>
> Over the past few days I created a new schema on Oracle 12c on Solid State
> Disks (SSD) and used databump (exdp, imdp) to migrate Hive database from
> Oracle 11g to Oracle 12c on SSD.
>
> Couple of years ago I did some work for OLTP operations (many random
> access via index scan plus serial scans) for Oracle 11g and SAP ASE 15.7. I
> noticed that for Random Access with Index scans the performance improves by
> a factor of 20  because of much faster seek time for SSD
>
> https://www.scribd.com/doc/119707722/IOUG-SELECT-Q312-Final
>
> I have recently seen some contention for access resources in Hive
> database, so I think going to SSD will improve the performance of Hive in
> general.
>
> I will look at AWR reports to see how beneficial this set up is as this
> Oracle instance is more and less dedicated to Hive.
>
> HTH
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>


Re: Apache Flink

2016-04-17 Thread Igor Berman
latency in Flink is not eliminated, but it might be smaller since Flink
process each event 1-by-1 while Spark does microbatching(so you can't
achieve latency lesser than your microbatch config)
probably Spark will have better throughput due to this microbatching



On 17 April 2016 at 14:47, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> You probably read this benchmark at Yahoo, any comments from Spark?
>
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
>
>
> On 17 Apr 2016, at 12:41, andy petrella  wrote:
>
> Just adding one thing to the mix: `that the latency for streaming data is
> eliminated` is insane :-D
>
> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>  It seems that Flink argues that the latency for streaming data is
>> eliminated whereas with Spark RDD there is this latency.
>>
>> I noticed that Flink does not support interactive shell much like Spark
>> shell where you can add jars to it to do kafka testing. The advice was to
>> add the streaming Kafka jar file to CLASSPATH but that does not work.
>>
>> Most Flink documentation also rather sparce with the usual example of
>> word count which is not exactly what you want.
>>
>> Anyway I will have a look at it further. I have a Spark Scala streaming
>> Kafka program that works fine in Spark and I want to recode it using Scala
>> for Flink with Kafka but have difficulty importing and testing libraries.
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 April 2016 at 02:41, Ascot Moss  wrote:
>>
>>> I compared both last month, seems to me that Flink's MLLib is not yet
>>> ready.
>>>
>>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks Ted. I was wondering if someone is using both :)

 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com



 On 16 April 2016 at 17:08, Ted Yu  wrote:

> Looks like this question is more relevant on flink mailing list :-)
>
> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Has anyone used Apache Flink instead of Spark by any chance
>>
>> I am interested in its set of libraries for Complex Event Processing.
>>
>> Frankly I don't know if it offers far more than Spark offers.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>

>>>
>> --
> andy
>
>
>


Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
You probably read this benchmark at Yahoo, any comments from Spark?
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
 



> On 17 Apr 2016, at 12:41, andy petrella  wrote:
> 
> Just adding one thing to the mix: `that the latency for streaming data is 
> eliminated` is insane :-D
> 
> On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh  > wrote:
>  It seems that Flink argues that the latency for streaming data is eliminated 
> whereas with Spark RDD there is this latency.
> 
> I noticed that Flink does not support interactive shell much like Spark shell 
> where you can add jars to it to do kafka testing. The advice was to add the 
> streaming Kafka jar file to CLASSPATH but that does not work.
> 
> Most Flink documentation also rather sparce with the usual example of word 
> count which is not exactly what you want.
> 
> Anyway I will have a look at it further. I have a Spark Scala streaming Kafka 
> program that works fine in Spark and I want to recode it using Scala for 
> Flink with Kafka but have difficulty importing and testing libraries.
> 
> Cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 17 April 2016 at 02:41, Ascot Moss  > wrote:
> I compared both last month, seems to me that Flink's MLLib is not yet ready.
> 
> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh  > wrote:
> Thanks Ted. I was wondering if someone is using both :)
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 16 April 2016 at 17:08, Ted Yu  > wrote:
> Looks like this question is more relevant on flink mailing list :-)
> 
> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh  > wrote:
> Hi,
> 
> Has anyone used Apache Flink instead of Spark by any chance
> 
> I am interested in its set of libraries for Complex Event Processing.
> 
> Frankly I don't know if it offers far more than Spark offers.
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> 
> 
> 
> -- 
> andy



Re: Apache Flink

2016-04-17 Thread andy petrella
Just adding one thing to the mix: `that the latency for streaming data is
eliminated` is insane :-D

On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh 
wrote:

>  It seems that Flink argues that the latency for streaming data is
> eliminated whereas with Spark RDD there is this latency.
>
> I noticed that Flink does not support interactive shell much like Spark
> shell where you can add jars to it to do kafka testing. The advice was to
> add the streaming Kafka jar file to CLASSPATH but that does not work.
>
> Most Flink documentation also rather sparce with the usual example of word
> count which is not exactly what you want.
>
> Anyway I will have a look at it further. I have a Spark Scala streaming
> Kafka program that works fine in Spark and I want to recode it using Scala
> for Flink with Kafka but have difficulty importing and testing libraries.
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 April 2016 at 02:41, Ascot Moss  wrote:
>
>> I compared both last month, seems to me that Flink's MLLib is not yet
>> ready.
>>
>> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Ted. I was wondering if someone is using both :)
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 April 2016 at 17:08, Ted Yu  wrote:
>>>
 Looks like this question is more relevant on flink mailing list :-)

 On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> Has anyone used Apache Flink instead of Spark by any chance
>
> I am interested in its set of libraries for Complex Event Processing.
>
> Frankly I don't know if it offers far more than Spark offers.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


>>>
>>
> --
andy


Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Jörn Franke

You could also explore the in-memory database of 12c . However, I am not sure 
how beneficial it is for Oltp scenarios.

I am excited to see how the performance will be on hbase as a hive metastore.

Nevertheless, your results on Oracle/SSD will be beneficial for the community.

> On 17 Apr 2016, at 11:52, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I have had my Hive metastore database on Oracle 11g supporting concurrency 
> (with added transactional capability)
> 
> Over the past few days I created a new schema on Oracle 12c on Solid State 
> Disks (SSD) and used databump (exdp, imdp) to migrate Hive database from 
> Oracle 11g to Oracle 12c on SSD.
> 
> Couple of years ago I did some work for OLTP operations (many random access 
> via index scan plus serial scans) for Oracle 11g and SAP ASE 15.7. I noticed 
> that for Random Access with Index scans the performance improves by a factor 
> of 20  because of much faster seek time for SSD
> 
> https://www.scribd.com/doc/119707722/IOUG-SELECT-Q312-Final
> 
> I have recently seen some contention for access resources in Hive database, 
> so I think going to SSD will improve the performance of Hive in general.
> 
> I will look at AWR reports to see how beneficial this set up is as this 
> Oracle instance is more and less dedicated to Hive.
> 
> HTH
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  


Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
 It seems that Flink argues that the latency for streaming data is
eliminated whereas with Spark RDD there is this latency.

I noticed that Flink does not support interactive shell much like Spark
shell where you can add jars to it to do kafka testing. The advice was to
add the streaming Kafka jar file to CLASSPATH but that does not work.

Most Flink documentation also rather sparce with the usual example of word
count which is not exactly what you want.

Anyway I will have a look at it further. I have a Spark Scala streaming
Kafka program that works fine in Spark and I want to recode it using Scala
for Flink with Kafka but have difficulty importing and testing libraries.

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 April 2016 at 02:41, Ascot Moss  wrote:

> I compared both last month, seems to me that Flink's MLLib is not yet
> ready.
>
> On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Ted. I was wondering if someone is using both :)
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 April 2016 at 17:08, Ted Yu  wrote:
>>
>>> Looks like this question is more relevant on flink mailing list :-)
>>>
>>> On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Has anyone used Apache Flink instead of Spark by any chance

 I am interested in its set of libraries for Complex Event Processing.

 Frankly I don't know if it offers far more than Spark offers.

 Thanks

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



>>>
>>>
>>
>


Moving Hive metastore to Solid State Disks

2016-04-17 Thread Mich Talebzadeh
Hi,

I have had my Hive metastore database on Oracle 11g supporting concurrency
(with added transactional capability)

Over the past few days I created a new schema on Oracle 12c on Solid State
Disks (SSD) and used databump (exdp, imdp) to migrate Hive database from
Oracle 11g to Oracle 12c on SSD.

Couple of years ago I did some work for OLTP operations (many random access
via index scan plus serial scans) for Oracle 11g and SAP ASE 15.7. I
noticed that for Random Access with Index scans the performance improves by
a factor of 20  because of much faster seek time for SSD

https://www.scribd.com/doc/119707722/IOUG-SELECT-Q312-Final

I have recently seen some contention for access resources in Hive database,
so I think going to SSD will improve the performance of Hive in general.

I will look at AWR reports to see how beneficial this set up is as this
Oracle instance is more and less dedicated to Hive.

HTH

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Spark support for Complex Event Processing (CEP)

2016-04-17 Thread Mich Talebzadeh
Hi,

Has Spark got libraries for CEP using Spark Streaming with Kafka by any
chance?

I am looking at Flink that supposed to have these libraries for CEP but I
find Flink itself very much work in progress.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com