Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

2018-04-12 Thread Szuromi Tamás
Hi Igor,

Have you started the external shuffle service manually?

Cheers

2018-04-12 10:48 GMT+02:00 igor.berman :

> Hi,
> any input regarding is it expected:
> Driver starts and unable to connect to external shuffle service on one of
> the nodes(no matter what is the reason)
> This makes framework to go to Inactive mode in Mesos UI
> However it seems that driver doesn't exits and continues to execute
> tasks(or
> tries to). The attached stacktrace below shows few lines around the
> connection error and aborting message
>
> The question is is it expected behaviour?
>
> Here is stacktracke
>
> I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051
> Exception in thread "Thread-295" java.io.IOException: Failed to connect to
> my-company.com/x.x.x.x:7337
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:232)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:182)
> at
> org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.
> registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
> at
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBac
> kend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused:my-company.com/x.x.x.x:7337
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(
> NioSocketChannel.java:257)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(
> AbstractNioChannel.java:291)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:631)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(
> NioEventLoop.java:566)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(
> NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:131)
> at
> io.netty.util.concurrent.DefaultThreadFactory$
> DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)
> I0412 07:35:12.032925   277 sched.cpp:2055] Asked to abort the driver
> I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Windows10 + pyspark + ipython + csv file loading with timestamps

2017-12-18 Thread Szuromi Tamás
Hi Esa,

I'm using it like this:
https://gist.github.com/tromika/1cda392242fdd66befe7970d80380216

cheers,

2017-12-16 11:04 GMT+01:00 Esa Heikkinen :

> Hi
>
> Does anyone have any hints or example (code) how to get combination:
> Windows10 + pyspark + ipython notebook + csv file loading with timestamps
> (timeseries data) to dataframe or RDD to work ?
>
> I have already installed windows10 + pyspark + ipython notebook and they
> seem to work, but my python code in notebook does not, because "spark
> context" may not work ?
>
> What commands should be put into the beginning of the notebook ? sc =
> SparkContext.getOrCreate() ? spark = SparkSession(sc) ?
>
> I have installed: spark-2.2.1-bin-hadoop2.7  and ipython 6.1.0 to
> Windows10.
>
> 
>
> Eras
>


Re: Metadata Management

2017-10-20 Thread Szuromi Tamás
Hi Vasu,

https://github.com/linkedin/WhereHows might be a good fit.

Cheers
Tamas

On 2017. Oct 19., Thu at 23:22, Vasu Gourabathina 
wrote:

> All:
>
> This may be off topic for Spark, but I'm sure several of you might have
> used some form of this as part of your BigData implementations. So, wanted
> to reach out.
>
> As part of the Data Lake and Data Processing (by Spark as an example), we
> might end up different form-factors for the files (via, cleanup, enrichment
> etc).
>
> In order to make this data available for data exploration by analysts,
> data scientists - how to manage the metadata?
>   - Creating Metadata Repository
>   - Make the schemas available for users, so they may use it to create
> Hive tables, use them by Presto etc.
>
> Can you recommend some patterns (or tools) to help manage the Metadata?
> Trying to reduce the dependency on the engineers and make the
> analysts/scientists be self-sufficient as much as possible.
>
> Azure and AWS Glue Data Catalog seem to address this. Any inputs on these
> two?
>
> Appreciate in advance.
>
> Thanks,
> Vasu.
>


Re: how to convert the binary from kafak to srring pleaae

2017-07-24 Thread Szuromi Tamás
Hi,

You can cast it to string in a select or you can set the value.deserializer
parameter for kafka.

cheers,

2017-07-24 4:44 GMT+02:00 萝卜丝炒饭 <1427357...@qq.com>:

> Hi all
>
> I want to change the binary from kafka to string. Would you like help me
> please?
>
> val df = ss.readStream.format("kafka").option("kafka.bootstrap.
> server","")
> .option("subscribe","")
> .load
>
> val value = df.select("value")
>
> value.writeStream
> .outputMode("append")
> .format("console")
> .start()
> .awaitTermination()
>
>
> Above code outputs result like:
>
> ++
> |value|
> +-+
> |[61,61]|
> +-+
>
>
> 61 is character a receiced from kafka.
> I want to print [a,a] or aa.
> How should I do please?
>


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Szuromi Tamás
For us, after some Spark Streaming transformation, Elasticsearch + Kibana
is a great combination to store and visualize data.
An alternative solution that we use is Spark Streaming put some data back
to Kafka and we consume it with nodejs.

Cheers,
Tamas

2017-03-30 9:25 GMT+02:00 Alonso Isidoro Roman :

> Read this first:
>
> http://www.oreilly.com/data/free/big-data-analytics-
> emerging-architecture.csp
>
> https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf
>
> http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf
>
> http://www.gartner.com/smarterwithgartner/six-best-
> practices-for-real-time-analytics/
>
> https://speakerdeck.com/elasticsearch/using-elasticsearch-logstash-and-
> kibana-to-create-realtime-dashboards
>
> https://www.youtube.com/watch?v=PuvHINcU9DI
>
> then take a look to
>
> https://kudu.apache.org/
>
> Tell us later what you think.
>
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2017-03-30 7:14 GMT+02:00 Gaurav Pandya :
>
>> Hi Noorul,
>>
>> Thanks for the reply.
>> But then how to build the dashboard report? Don't we need to store data
>> anywhere?
>> Please suggest.
>>
>> Thanks.
>> Gaurav
>>
>> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
>> noo...@noorul.com> wrote:
>>
>>> I think better place would be a in memory cache for real time.
>>>
>>> Regards,
>>> Noorul
>>>
>>> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
>>> wrote:
>>> > I am getting streaming data and want to show them onto dashboards in
>>> real
>>> > time?
>>> > May I know how best we can handle these streaming data? where to
>>> store? (DB
>>> > or HDFS or ???)
>>> > I want to give users a real time analytics experience.
>>> >
>>> > Please suggest possible ways. Thanks.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-o
>>> n-dashboards-for-real-time-user-experience-tp28548.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>
>>
>