Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Akhilesh Pathodia
Hi Renato,

Which version of Spark are you using?

If spark version is 1.3.0 or more then you can use SqlContext to read the
parquet file which will give you DataFrame. Please follow the below link:

https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically

Thanks,
Akhilesh

On Sat, Aug 27, 2016 at 3:26 AM, Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com> wrote:

> Anybody? I think Rory also didn't get an answer from the list ...
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3CCAC+
> fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E
>
>
>
> 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo <
> renatoj.marroq...@gmail.com>:
>
>> Hi all,
>>
>> I am trying to use parquet files as input for DStream operations, but I
>> can't find any documentation or example. The only thing I found was [1] but
>> I also get the same error as in the post (Class
>> parquet.avro.AvroReadSupport not found).
>> Ideally I would like to do have something like this:
>>
>> val oDStream = ssc.fileStream[Void, Order, ParquetInputFormat[Order]]("da
>> ta/")
>>
>> where Order is a case class and the files inside "data" are all parquet
>> files.
>> Any hints would be highly appreciated. Thanks!
>>
>>
>> Best,
>>
>> Renato M.
>>
>> [1] http://stackoverflow.com/questions/35413552/how-do-i-read-
>> in-parquet-files-using-ssc-filestream-and-what-is-the-nature
>>
>
>


Re: Please assist: Building Docker image containing spark 2.0

2016-08-27 Thread Marco Mistroni
Thanks, i'll follow advice and try again

kr
 marco

On Sat, Aug 27, 2016 at 4:04 AM, Mike Metzger 
wrote:

> I would also suggest building the container manually first and setup
> everything you specifically need.  Once done, you can then grab the history
> file, pull out the invalid commands and build out the completed
> Dockerfile.  Trying to troubleshoot an installation via Dockerfile is often
> an exercise in futility.
>
> Thanks
>
> Mike
>
>
> On Fri, Aug 26, 2016 at 5:14 PM, Michael Gummelt 
> wrote:
>
>> Run with "-X -e" like the error message says. See what comes out.
>>
>> On Fri, Aug 26, 2016 at 2:23 PM, Tal Grynbaum 
>> wrote:
>>
>>> Did you specify -Dscala-2.10
>>> As in
>>> ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4
>>> -Dscala-2.10 -DskipTests clean package
>>> If you're building with scala 2.10
>>>
>>> On Sat, Aug 27, 2016, 00:18 Marco Mistroni  wrote:
>>>
 Hello Michael
 uhm i celebrated too soon
 Compilation of spark on docker image went near the end and then it
 errored out with this message

 INFO] BUILD FAILURE
 [INFO] 
 
 [INFO] Total time: 01:01 h
 [INFO] Finished at: 2016-08-26T21:12:25+00:00
 [INFO] Final Memory: 69M/324M
 [INFO] 
 
 [ERROR] Failed to execute goal 
 net.alchim31.maven:scala-maven-plugin:3.2.2:compile
 (scala-compile-first) on project spark-mllib_2.11: Execution
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.2.2:compile
 failed. CompileFailed -> [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with
 the -e switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please read the following articles:
 [ERROR] [Help 1] http://cwiki.apache.org/conflu
 ence/display/MAVEN/PluginExecutionException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with
 the command
 [ERROR]   mvn  -rf :spark-mllib_2.11
 The command '/bin/sh -c ./build/mvn -Pyarn -Phadoop-2.4
 -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero code:
 1

 what am i forgetting?
 once again, last command i launched on the docker file is


 RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
 clean package

 kr



 On Fri, Aug 26, 2016 at 6:18 PM, Michael Gummelt <
 mgumm...@mesosphere.io> wrote:

> :)
>
> On Thu, Aug 25, 2016 at 2:29 PM, Marco Mistroni 
> wrote:
>
>> No i wont accept that :)
>> I can't believe i have wasted 3 hrs for a space!
>>
>> Many thanks MIchael!
>>
>> kr
>>
>> On Thu, Aug 25, 2016 at 10:01 PM, Michael Gummelt <
>> mgumm...@mesosphere.io> wrote:
>>
>>> You have a space between "build" and "mvn"
>>>
>>> On Thu, Aug 25, 2016 at 1:31 PM, Marco Mistroni >> > wrote:
>>>
 HI all
  sorry for the partially off-topic, i hope there's someone on the
 list who has tried the same and encountered similar issuse

 Ok so i have created a Docker file to build an ubuntu container
 which inlcudes spark 2.0, but somehow when it gets to the point where 
 it
 has to kick off  ./build/mvn command, it errors out with the following

 ---> Running in 8c2aa6d59842
 /bin/sh: 1: ./build: Permission denied
 The command '/bin/sh -c ./build mvn -Pyarn -Phadoop-2.4
 -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero 
 code:
 126

 I am puzzled as i am root when i build the container, so i should
 not encounter this issue (btw, if instead of running mvn from the build
 directory  i use the mvn which i installed on the container, it works 
 fine
 but it's  painfully slow)

 here are the details of my Spark command( scala 2.10, java 1.7 ,
 mvn 3.3.9 and git have already been installed)

 # Spark
 RUN echo "Installing Apache spark 2.0"
 RUN git clone git://github.com/apache/spark.git
 WORKDIR /spark
 RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
 -DskipTests clean package


 Could anyone assist pls?

 kindest regarsd
  Marco


>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>


Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread Mich Talebzadeh
Are we actually looking for a eal time dashboard of some sort for Spark UI
interface?

After all one can think a real time dashboard can do this!

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 August 2016 at 23:38, Mich Talebzadeh 
wrote:

> Thanks Jacek,
>
> I will have a look. I think it is long overdue.
>
> I mean we try to micro batch and stream everything below seconds but when
> it comes to help  monitor basics we are still miles behind :(
>
> Cheers,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 23:21, Jacek Laskowski  wrote:
>
>> Hi Mich,
>>
>> I don't think so. There is support for a UI page refresh but I haven't
>> seen it in use.
>>
>> See StreamingPage [1] where it schedules refresh every 5 secs, i.e.
>> Some(5000). In SparkUIUtils.headerSparkPage [2] there is
>> refreshInterval but it's not used in any place in Spark.
>>
>> Time to fill an JIRA issue?
>>
>> What about REST API and httpie updating regularly [3]? Perhaps Metrics
>> with ConsoleSink [4]?
>>
>> [1] https://github.com/apache/spark/blob/master/streaming/src/
>> main/scala/org/apache/spark/streaming/ui/StreamingPage.scala#L158
>> [2] https://github.com/apache/spark/blob/master/core/src/main/
>> scala/org/apache/spark/ui/UIUtils.scala#L202
>> [3] http://spark.apache.org/docs/latest/monitoring.html#rest-api
>> [4] http://spark.apache.org/docs/latest/monitoring.html#metrics
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Thu, Aug 25, 2016 at 11:55 AM, Mich Talebzadeh
>>  wrote:
>> > Hi,
>> >
>> > This may be already there.
>> >
>> > A spark job opens up a UI on port specified by --conf
>> "spark.ui.port=${SP}"
>> > that defaults to 4040.
>> >
>> > However, on UI one needs to refresh the page to see the progress.
>> >
>> > Can this be polled so it is refreshed automatically
>> >
>> > Thanks
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> arise
>> > from relying on this email's technical content is explicitly
>> disclaimed. The
>> > author will in no case be liable for any monetary damages arising from
>> such
>> > loss, damage or destruction.
>> >
>> >
>>
>
>


Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread Mich Talebzadeh
Thanks Sivakumaran

I don't think we can use Zeppelin for this purpose. It is not a real time
dashboard or can it be. I use it but much like Tableau with added Scala
programming.

Does anyone know of open source real time dashboards?


Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 August 2016 at 09:42, Sivakumaran S  wrote:

> I would love to participate in developing a dashboard of some sort in lieu
> (or at least complement it)  of Spark UI .
>
> Regards,
>
> Sivakumaran S
>
> On 27 Aug 2016 9:34 a.m., Mich Talebzadeh 
> wrote:
>
> Are we actually looking for a eal time dashboard of some sort for Spark UI
> interface?
>
> After all one can think a real time dashboard can do this!
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 23:38, Mich Talebzadeh 
> wrote:
>
> Thanks Jacek,
>
> I will have a look. I think it is long overdue.
>
> I mean we try to micro batch and stream everything below seconds but when
> it comes to help  monitor basics we are still miles behind :(
>
> Cheers,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 23:21, Jacek Laskowski  wrote:
>
> Hi Mich,
>
> I don't think so. There is support for a UI page refresh but I haven't
> seen it in use.
>
> See StreamingPage [1] where it schedules refresh every 5 secs, i.e.
> Some(5000). In SparkUIUtils.headerSparkPage [2] there is
> refreshInterval but it's not used in any place in Spark.
>
> Time to fill an JIRA issue?
>
> What about REST API and httpie updating regularly [3]? Perhaps Metrics
> with ConsoleSink [4]?
>
> [1] https://github.com/apache/spark/blob/master/streaming/src/ma
> in/scala/org/apache/spark/streaming/ui/StreamingPage.scala#L158
> [2] https://github.com/apache/spark/blob/master/core/src/main/sc
> ala/org/apache/spark/ui/UIUtils.scala#L202
> [3] http://spark.apache.org/docs/latest/monitoring.html#rest-api
> [4] http://spark.apache.org/docs/latest/monitoring.html#metrics
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Aug 25, 2016 at 11:55 AM, Mich Talebzadeh
>  wrote:
> > Hi,
> >
> > This may be already there.
> >
> > A spark job opens up a UI on port specified by --conf
> "spark.ui.port=${SP}"
> > that defaults to 4040.
> >
> > However, on UI one needs to refresh the page to see the progress.
> >
> > Can this be polled so it is refreshed automatically
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ
> d6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
>
>
>
>
>


Issues with Spark On Hbase Connector and versions

2016-08-27 Thread spats
Regarding hbase connector by hortonworks
https://github.com/hortonworks-spark/shc, it would be great if someone can
answer these

1. What versions of Hbase & Spark expected? I could not run examples
provided using spark 1.6.0 & hbase 1.2.0
2. I get error when i run example provided  here

  
, any pointers on what i am doing wrong?

looks like spark not reading hbase-site.xml, but passed it in --files while
spark-shell
e.g --files
/etc/hbase/conf/hbase-site.xml,/etc/hbase/conf/hdfs-site.xml,/etc/hbase/conf/core-site.xml

error 
16/08/27 12:35:00 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-Spark-On-Hbase-Connector-and-versions-tp27610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread Jacek Laskowski
Hi,

There's no better way to start a project than...github it :-) Create a new
project, clone it and do dzieła! (= go ahead in Polish).

Jacek

On 27 Aug 2016 10:42 a.m., "Sivakumaran S"  wrote:

> I would love to participate in developing a dashboard of some sort in lieu
> (or at least complement it)  of Spark UI .
>
> Regards,
>
> Sivakumaran S
>
> On 27 Aug 2016 9:34 a.m., Mich Talebzadeh 
> wrote:
>
> Are we actually looking for a eal time dashboard of some sort for Spark UI
> interface?
>
> After all one can think a real time dashboard can do this!
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 23:38, Mich Talebzadeh 
> wrote:
>
> Thanks Jacek,
>
> I will have a look. I think it is long overdue.
>
> I mean we try to micro batch and stream everything below seconds but when
> it comes to help  monitor basics we are still miles behind :(
>
> Cheers,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 23:21, Jacek Laskowski  wrote:
>
> Hi Mich,
>
> I don't think so. There is support for a UI page refresh but I haven't
> seen it in use.
>
> See StreamingPage [1] where it schedules refresh every 5 secs, i.e.
> Some(5000). In SparkUIUtils.headerSparkPage [2] there is
> refreshInterval but it's not used in any place in Spark.
>
> Time to fill an JIRA issue?
>
> What about REST API and httpie updating regularly [3]? Perhaps Metrics
> with ConsoleSink [4]?
>
> [1] https://github.com/apache/spark/blob/master/streaming/src/ma
> in/scala/org/apache/spark/streaming/ui/StreamingPage.scala#L158
> [2] https://github.com/apache/spark/blob/master/core/src/main/sc
> ala/org/apache/spark/ui/UIUtils.scala#L202
> [3] http://spark.apache.org/docs/latest/monitoring.html#rest-api
> [4] http://spark.apache.org/docs/latest/monitoring.html#metrics
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Aug 25, 2016 at 11:55 AM, Mich Talebzadeh
>  wrote:
> > Hi,
> >
> > This may be already there.
> >
> > A spark job opens up a UI on port specified by --conf
> "spark.ui.port=${SP}"
> > that defaults to 4040.
> >
> > However, on UI one needs to refresh the page to see the progress.
> >
> > Can this be polled so it is refreshed automatically
> >
> > Thanks
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ
> d6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
>
>
>
>
>


Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread Mich Talebzadeh
Hi All,

GitHub project SparkUIDashboard created here





[image: Inline images 2]
Let use put the show on the road :)

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 August 2016 at 12:53, Sivakumaran S  wrote:

> Hi Mich,
>
> Unlikely that we can use Zeppelin for dynamic, real time update
> visualisation. It makes nice, static visuals.
>
> I was thinking more on the lines of http://dashingdemo.
> herokuapp.com/sample
>
> The library is http://dashing.io
>
> There are more widgets that can be used https://github.com/
> Shopify/dashing/wiki/Additional-Widgets
>
> The Spark UI is functional, but I am looking forward to some aesthetics
> and high level picture of the process. Using Websockets, the dashboard can
> be updated real time without the need of refreshing the page.
>
> Regards,
>
> Sivakumaran S
>
>
> On 27-Aug-2016, at 10:10 AM, Mich Talebzadeh 
> wrote:
>
> Thanks Sivakumaran
>
> I don't think we can use Zeppelin for this purpose. It is not a real time
> dashboard or can it be. I use it but much like Tableau with added Scala
> programming.
>
> Does anyone know of open source real time dashboards?
>
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 August 2016 at 09:42, Sivakumaran S  wrote:
>
>> I would love to participate in developing a dashboard of some sort in
>> lieu (or at least complement it)  of Spark UI .
>>
>> Regards,
>>
>> Sivakumaran S
>>
>> On 27 Aug 2016 9:34 a.m., Mich Talebzadeh 
>> wrote:
>>
>> Are we actually looking for a eal time dashboard of some sort for Spark
>> UI interface?
>>
>> After all one can think a real time dashboard can do this!
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 August 2016 at 23:38, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Jacek,
>>
>> I will have a look. I think it is long overdue.
>>
>> I mean we try to micro batch and stream everything below seconds but when
>> it comes to help  monitor basics we are still miles behind :(
>>
>> Cheers,
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 August 2016 at 23:21, Jacek Laskowski  wrote:
>>
>> Hi Mich,
>>
>> I don't think so. There is support for a UI page refresh but I haven't
>> seen it in use.
>>
>> See StreamingPage [1] where it schedules refresh every 5 secs, i.e.
>> Some(5000). In SparkUIUtils.headerSparkPage [2] there is
>> refreshInterval but it's not used in any place in Spark.
>>
>> Time to fill an JIRA issue?
>>
>> What about REST API and httpie updating regularly [3]? Perhaps Metrics
>> with ConsoleSink [4]?
>>
>> [1] https://github.com/apache/spark/blob/master/streaming/src/ma
>> in/scala/org/apache/spark/streaming/ui/StreamingPage.scala#L158
>> [2] https://github.com/apache/

Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Renato Marroquín Mogrovejo
Hi Akhilesh,

Thanks for your response.
I am using Spark 1.6.1 and what I am trying to do is to ingest parquet
files into the Spark Streaming, not in batch operations.

val ssc = new StreamingContext(sc, Seconds(5))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class",
"parquet.avro.AvroReadSupport")

val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val oDStream = ssc.fileStream[Void, Order,
ParquetInputFormat]("TempData/origin/")

oDStream.foreachRDD(relation => {
  if (relation.count() == 0)
println("Nothing received")
  else {
val rDF = relation.toDF().as[Order]
println(rDF.first())
  }
})

But that doesn't work. Any ideas?


Best,

Renato M.

2016-08-27 9:01 GMT+02:00 Akhilesh Pathodia :

> Hi Renato,
>
> Which version of Spark are you using?
>
> If spark version is 1.3.0 or more then you can use SqlContext to read the
> parquet file which will give you DataFrame. Please follow the below link:
>
> https://spark.apache.org/docs/1.5.0/sql-programming-guide.
> html#loading-data-programmatically
>
> Thanks,
> Akhilesh
>
> On Sat, Aug 27, 2016 at 3:26 AM, Renato Marroquín Mogrovejo <
> renatoj.marroq...@gmail.com> wrote:
>
>> Anybody? I think Rory also didn't get an answer from the list ...
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201602.
>> mbox/%3CCAC+fRE14PV5nvQHTBVqDC+6DkXo73oDAzfqsLbSo8F94ozO5nQ@
>> mail.gmail.com%3E
>>
>>
>>
>> 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo <
>> renatoj.marroq...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I am trying to use parquet files as input for DStream operations, but I
>>> can't find any documentation or example. The only thing I found was [1] but
>>> I also get the same error as in the post (Class
>>> parquet.avro.AvroReadSupport not found).
>>> Ideally I would like to do have something like this:
>>>
>>> val oDStream = ssc.fileStream[Void, Order, ParquetInputFormat[Order]]("da
>>> ta/")
>>>
>>> where Order is a case class and the files inside "data" are all parquet
>>> files.
>>> Any hints would be highly appreciated. Thanks!
>>>
>>>
>>> Best,
>>>
>>> Renato M.
>>>
>>> [1] http://stackoverflow.com/questions/35413552/how-do-i-read-in
>>> -parquet-files-using-ssc-filestream-and-what-is-the-nature
>>>
>>
>>
>


Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread nguyen duc Tuan
The simplest solution that I found: using an browser extension which do
that for you :D. For example, if you are using Chrome, you can use this
extension:
https://chrome.google.com/webstore/detail/easy-auto-refresh/aabcgdmkeabbnleenpncegpcngjpnjkc/related?hl=en
An other way, but a bit more manually using javascript: start with a
window, you will create a child window with your target url. The parent
window will refresh that child window for you. Due to same-original
pollicy, you should set parent url to the same url as your target url. Try
this in your web console:
wi = window.open("your target url")
var timeInMinis = 2000
setInterval(function(){ wi.location.reload();}, timeInMinis)
Hope this help.

2016-08-27 20:17 GMT+07:00 Mich Talebzadeh :

> Hi All,
>
> GitHub project SparkUIDashboard created here
> 
>
>
>
>
> [image: Inline images 2]
> Let use put the show on the road :)
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 August 2016 at 12:53, Sivakumaran S  wrote:
>
>> Hi Mich,
>>
>> Unlikely that we can use Zeppelin for dynamic, real time update
>> visualisation. It makes nice, static visuals.
>>
>> I was thinking more on the lines of http://dashingdemo.herokuap
>> p.com/sample
>>
>> The library is http://dashing.io
>>
>> There are more widgets that can be used https://github.com/Shopif
>> y/dashing/wiki/Additional-Widgets
>>
>> The Spark UI is functional, but I am looking forward to some aesthetics
>> and high level picture of the process. Using Websockets, the dashboard can
>> be updated real time without the need of refreshing the page.
>>
>> Regards,
>>
>> Sivakumaran S
>>
>>
>> On 27-Aug-2016, at 10:10 AM, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Sivakumaran
>>
>> I don't think we can use Zeppelin for this purpose. It is not a real time
>> dashboard or can it be. I use it but much like Tableau with added Scala
>> programming.
>>
>> Does anyone know of open source real time dashboards?
>>
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 August 2016 at 09:42, Sivakumaran S 
>> wrote:
>>
>>> I would love to participate in developing a dashboard of some sort in
>>> lieu (or at least complement it)  of Spark UI .
>>>
>>> Regards,
>>>
>>> Sivakumaran S
>>>
>>> On 27 Aug 2016 9:34 a.m., Mich Talebzadeh 
>>> wrote:
>>>
>>> Are we actually looking for a eal time dashboard of some sort for Spark
>>> UI interface?
>>>
>>> After all one can think a real time dashboard can do this!
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 26 August 2016 at 23:38, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks Jacek,
>>>
>>> I will have a look. I think it is long overdue.
>>>
>>> I mean we try to micro batch and stream everything below seconds but
>>> when it comes to help  monitor basics we are still miles behind :(
>>>
>>> Cheers,
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage

Re: Please assist: Building Docker image containing spark 2.0

2016-08-27 Thread Marco Mistroni
all good. Tal's suggestion did it. i shud have read the manual first :(
tx for assistance

On Sat, Aug 27, 2016 at 9:06 AM, Marco Mistroni  wrote:

> Thanks, i'll follow advice and try again
>
> kr
>  marco
>
> On Sat, Aug 27, 2016 at 4:04 AM, Mike Metzger 
> wrote:
>
>> I would also suggest building the container manually first and setup
>> everything you specifically need.  Once done, you can then grab the history
>> file, pull out the invalid commands and build out the completed
>> Dockerfile.  Trying to troubleshoot an installation via Dockerfile is often
>> an exercise in futility.
>>
>> Thanks
>>
>> Mike
>>
>>
>> On Fri, Aug 26, 2016 at 5:14 PM, Michael Gummelt 
>> wrote:
>>
>>> Run with "-X -e" like the error message says. See what comes out.
>>>
>>> On Fri, Aug 26, 2016 at 2:23 PM, Tal Grynbaum 
>>> wrote:
>>>
 Did you specify -Dscala-2.10
 As in
 ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4
 -Dscala-2.10 -DskipTests clean package
 If you're building with scala 2.10

 On Sat, Aug 27, 2016, 00:18 Marco Mistroni  wrote:

> Hello Michael
> uhm i celebrated too soon
> Compilation of spark on docker image went near the end and then it
> errored out with this message
>
> INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:01 h
> [INFO] Finished at: 2016-08-26T21:12:25+00:00
> [INFO] Final Memory: 69M/324M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-mllib_2.11: Execution
> scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> failed. CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with
> the -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1] http://cwiki.apache.org/conflu
> ence/display/MAVEN/PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with
> the command
> [ERROR]   mvn  -rf :spark-mllib_2.11
> The command '/bin/sh -c ./build/mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero 
> code:
> 1
>
> what am i forgetting?
> once again, last command i launched on the docker file is
>
>
> RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
> clean package
>
> kr
>
>
>
> On Fri, Aug 26, 2016 at 6:18 PM, Michael Gummelt <
> mgumm...@mesosphere.io> wrote:
>
>> :)
>>
>> On Thu, Aug 25, 2016 at 2:29 PM, Marco Mistroni 
>> wrote:
>>
>>> No i wont accept that :)
>>> I can't believe i have wasted 3 hrs for a space!
>>>
>>> Many thanks MIchael!
>>>
>>> kr
>>>
>>> On Thu, Aug 25, 2016 at 10:01 PM, Michael Gummelt <
>>> mgumm...@mesosphere.io> wrote:
>>>
 You have a space between "build" and "mvn"

 On Thu, Aug 25, 2016 at 1:31 PM, Marco Mistroni <
 mmistr...@gmail.com> wrote:

> HI all
>  sorry for the partially off-topic, i hope there's someone on the
> list who has tried the same and encountered similar issuse
>
> Ok so i have created a Docker file to build an ubuntu container
> which inlcudes spark 2.0, but somehow when it gets to the point where 
> it
> has to kick off  ./build/mvn command, it errors out with the following
>
> ---> Running in 8c2aa6d59842
> /bin/sh: 1: ./build: Permission denied
> The command '/bin/sh -c ./build mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package' returned a non-zero 
> code:
> 126
>
> I am puzzled as i am root when i build the container, so i should
> not encounter this issue (btw, if instead of running mvn from the 
> build
> directory  i use the mvn which i installed on the container, it works 
> fine
> but it's  painfully slow)
>
> here are the details of my Spark command( scala 2.10, java 1.7 ,
> mvn 3.3.9 and git have already been installed)
>
> # Spark
> RUN echo "Installing Apache spark 2.0"
> RUN git clone git://github.com/apache/spark.git
> WORKDIR /spark
> RUN ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
> -DskipTests clean package
>
>
> Could anyone assist pls?
>
> kindest regarsd

Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-27 Thread Mich Talebzadeh
Thanks Nguyen for the link.

I installed Super Refresh as ADD on to Chrome. By default the refresh is
stop until you set it to x seconds. However, the issue we have is that
Spark UI comes with 6+ tabs and you have to repeat the process for each tab.

However, that messes up the things. For example if I choose to refresh
"Executors" tab every 2 seconds and decide to refresh "Stages" tab, then
there is a race condition whereas you are thrown back to the last refresh
page which is not really what one wants.

Ideally one wants the Spark UI page identified by host:port to be the
driver and every other tab underneath say host:port/Stages to be refreshed
once we open that tab and stay there. If I go back to say SQL tab, I like
to see that refreshed ever n seconds.

I hope this makes sense.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 August 2016 at 15:01, nguyen duc Tuan  wrote:

> The simplest solution that I found: using an browser extension which do
> that for you :D. For example, if you are using Chrome, you can use this
> extension: https://chrome.google.com/webstore/detail/easy-auto-refresh/
> aabcgdmkeabbnleenpncegpcngjpnjkc/related?hl=en
> An other way, but a bit more manually using javascript: start with a
> window, you will create a child window with your target url. The parent
> window will refresh that child window for you. Due to same-original
> pollicy, you should set parent url to the same url as your target url. Try
> this in your web console:
> wi = window.open("your target url")
> var timeInMinis = 2000
> setInterval(function(){ wi.location.reload();}, timeInMinis)
> Hope this help.
>
> 2016-08-27 20:17 GMT+07:00 Mich Talebzadeh :
>
>> Hi All,
>>
>> GitHub project SparkUIDashboard created here
>> 
>>
>>
>>
>>
>> [image: Inline images 2]
>> Let use put the show on the road :)
>>
>> Cheers
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 August 2016 at 12:53, Sivakumaran S 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Unlikely that we can use Zeppelin for dynamic, real time update
>>> visualisation. It makes nice, static visuals.
>>>
>>> I was thinking more on the lines of http://dashingdemo.herokuap
>>> p.com/sample
>>>
>>> The library is http://dashing.io
>>>
>>> There are more widgets that can be used https://github.com/Shopif
>>> y/dashing/wiki/Additional-Widgets
>>>
>>> The Spark UI is functional, but I am looking forward to some aesthetics
>>> and high level picture of the process. Using Websockets, the dashboard can
>>> be updated real time without the need of refreshing the page.
>>>
>>> Regards,
>>>
>>> Sivakumaran S
>>>
>>>
>>> On 27-Aug-2016, at 10:10 AM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks Sivakumaran
>>>
>>> I don't think we can use Zeppelin for this purpose. It is not a real
>>> time dashboard or can it be. I use it but much like Tableau with added
>>> Scala programming.
>>>
>>> Does anyone know of open source real time dashboards?
>>>
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 August 2016 at 09:42, Sivakumaran S 
>>> wrote:
>>>
 I would love to participate in developing a dashboard of some sort in
 lieu (or at least complement it)  of Spark UI .

 Regards,

 Sivakumaran S

Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Sebastian Piu
Hi Renato,

Check here on how to do it, it is in Java but you can translate it to Scala
if that is what you need.

Cheers

On Sat, 27 Aug 2016, 14:24 Renato Marroquín Mogrovejo, <
renatoj.marroq...@gmail.com> wrote:

> Hi Akhilesh,
>
> Thanks for your response.
> I am using Spark 1.6.1 and what I am trying to do is to ingest parquet
> files into the Spark Streaming, not in batch operations.
>
> val ssc = new StreamingContext(sc, Seconds(5))
> ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class",
> "parquet.avro.AvroReadSupport")
>
> val sqlContext = new SQLContext(sc)
>
> import sqlContext.implicits._
>
> val oDStream = ssc.fileStream[Void, Order,
> ParquetInputFormat]("TempData/origin/")
>
> oDStream.foreachRDD(relation => {
>   if (relation.count() == 0)
> println("Nothing received")
>   else {
> val rDF = relation.toDF().as[Order]
> println(rDF.first())
>   }
> })
>
> But that doesn't work. Any ideas?
>
>
> Best,
>
> Renato M.
>
> 2016-08-27 9:01 GMT+02:00 Akhilesh Pathodia :
>
>> Hi Renato,
>>
>> Which version of Spark are you using?
>>
>> If spark version is 1.3.0 or more then you can use SqlContext to read the
>> parquet file which will give you DataFrame. Please follow the below link:
>>
>>
>> https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically
>>
>> Thanks,
>> Akhilesh
>>
>> On Sat, Aug 27, 2016 at 3:26 AM, Renato Marroquín Mogrovejo <
>> renatoj.marroq...@gmail.com> wrote:
>>
>>> Anybody? I think Rory also didn't get an answer from the list ...
>>>
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccac+fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E
>>>
>>>
>>>
>>> 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo <
>>> renatoj.marroq...@gmail.com>:
>>>
 Hi all,

 I am trying to use parquet files as input for DStream operations, but I
 can't find any documentation or example. The only thing I found was [1] but
 I also get the same error as in the post (Class
 parquet.avro.AvroReadSupport not found).
 Ideally I would like to do have something like this:

 val oDStream = ssc.fileStream[Void, Order,
 ParquetInputFormat[Order]]("data/")

 where Order is a case class and the files inside "data" are all parquet
 files.
 Any hints would be highly appreciated. Thanks!


 Best,

 Renato M.

 [1]
 http://stackoverflow.com/questions/35413552/how-do-i-read-in-parquet-files-using-ssc-filestream-and-what-is-the-nature

>>>
>>>
>>
>


Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Sebastian Piu
Forgot to paste the link...
http://ramblings.azurewebsites.net/2016/01/26/save-parquet-rdds-in-apache-spark/

On Sat, 27 Aug 2016, 19:18 Sebastian Piu,  wrote:

> Hi Renato,
>
> Check here on how to do it, it is in Java but you can translate it to
> Scala if that is what you need.
>
> Cheers
>
> On Sat, 27 Aug 2016, 14:24 Renato Marroquín Mogrovejo, <
> renatoj.marroq...@gmail.com> wrote:
>
>> Hi Akhilesh,
>>
>> Thanks for your response.
>> I am using Spark 1.6.1 and what I am trying to do is to ingest parquet
>> files into the Spark Streaming, not in batch operations.
>>
>> val ssc = new StreamingContext(sc, Seconds(5))
>>
>> ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class",
>> "parquet.avro.AvroReadSupport")
>>
>> val sqlContext = new SQLContext(sc)
>>
>> import sqlContext.implicits._
>>
>> val oDStream = ssc.fileStream[Void, Order,
>> ParquetInputFormat]("TempData/origin/")
>>
>> oDStream.foreachRDD(relation => {
>>   if (relation.count() == 0)
>> println("Nothing received")
>>   else {
>> val rDF = relation.toDF().as[Order]
>> println(rDF.first())
>>   }
>> })
>>
>> But that doesn't work. Any ideas?
>>
>>
>> Best,
>>
>> Renato M.
>>
>> 2016-08-27 9:01 GMT+02:00 Akhilesh Pathodia 
>> :
>>
>>> Hi Renato,
>>>
>>> Which version of Spark are you using?
>>>
>>> If spark version is 1.3.0 or more then you can use SqlContext to read
>>> the parquet file which will give you DataFrame. Please follow the below
>>> link:
>>>
>>>
>>> https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically
>>>
>>> Thanks,
>>> Akhilesh
>>>
>>> On Sat, Aug 27, 2016 at 3:26 AM, Renato Marroquín Mogrovejo <
>>> renatoj.marroq...@gmail.com> wrote:
>>>
 Anybody? I think Rory also didn't get an answer from the list ...


 https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccac+fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E



 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo <
 renatoj.marroq...@gmail.com>:

> Hi all,
>
> I am trying to use parquet files as input for DStream operations, but
> I can't find any documentation or example. The only thing I found was [1]
> but I also get the same error as in the post (Class
> parquet.avro.AvroReadSupport not found).
> Ideally I would like to do have something like this:
>
> val oDStream = ssc.fileStream[Void, Order,
> ParquetInputFormat[Order]]("data/")
>
> where Order is a case class and the files inside "data" are all
> parquet files.
> Any hints would be highly appreciated. Thanks!
>
>
> Best,
>
> Renato M.
>
> [1]
> http://stackoverflow.com/questions/35413552/how-do-i-read-in-parquet-files-using-ssc-filestream-and-what-is-the-nature
>


>>>
>>


Write parquet file from Spark Streaming

2016-08-27 Thread Kevin Tran
Hi Everyone,

Does anyone know how to write parquet file after parsing data in Spark
Streaming?



Thanks,

Kevin.


Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-27 Thread kant kodali
I understand now that for I cannot use spark streaming window operation without
checkpointing to HDFS as pointed out by @Ofir but Without window operation I
don't think we can do much with spark streaming. so since it is very essential
can I use Cassandra as a distributed storage? If so, can I see an example on how
I can tell spark cluster to use Cassandra for checkpointing and others if at
all.





On Fri, Aug 26, 2016 9:50 AM, Steve Loughran ste...@hortonworks.com wrote:

On 26 Aug 2016, at 12:58, kant kodali < kanth...@gmail.com > wrote:
@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and given the
ease of consul (which btw uses raft for the election) and etcd lot of us are
more inclined to avoid ZK.
And yes any technology needs time for maturity but that said it shouldn't stop
us from transitioning. for example people started using spark when it first
released instead of waiting for spark 2.0 where there are lot of optimizations
and bug fixes.


One way to look at the problem is "what is the cost if something doesn't work?"
If it's some HA consensus system, failure modes are "consensus failure,
everything goes into minority mode and offline". service lost, data fine.
Another is "partition with both groups thinking they are in charge", which is
more dangerous. then there's "partitioning event not detected", which may be
bad.
so: consider the failure modes and then consider not so much whether the tech
you are using is vulnerable to it, but "if it goes wrong, does it matter?"

Even before HDFS had HA with ZK/bookkeeper it didn't fail very often. And if you
looked at the causes of those failures, things like backbone switch failure are
so traumatic that things like ZK/etcd failures aren't going to make much of a
difference. The filesystem is down.
Generally, integrity gets priority over availability. That said, S3 and the like
have put availability ahead of consistency; Cassandra can offer that
too.—sometimes it is the right strategy