Re: Spark Streaming to REST API

2017-12-21 Thread ashish rawat
Sorry, for not making it explicit. We are using Spark Streaming as the
streaming solution and I was wondering if it is a common pattern to do per
tuple redis read/write and write to a REST API through Spark Streaming.

Regards,
Ashish

On Fri, Dec 22, 2017 at 4:00 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> hi Ashish,
>
> I was just wondering if there is any particular reason why you are posting
> this to a SPARK group?
>
> Regards,
> Gourav
>
> On Thu, Dec 21, 2017 at 8:32 PM, ashish rawat <dceash...@gmail.com> wrote:
>
>> Hi,
>>
>> We are working on a streaming solution where multiple out of order
>> streams are flowing in the system and we need to join the streams based on
>> a unique id. We are planning to use redis for this, where for every tuple,
>> we will lookup if the id exists, we join if it does or else put the tuple
>> into redis. Also, we need to write the final out to a system through REST
>> API (the system doesn't provide any other mechanism to write).
>>
>> Is it a common pattern to read/write to db per tuple? Also, are there any
>> connectors to write to REST endpoints.
>>
>> Regards,
>> Ashish
>>
>
>


Spark Streaming to REST API

2017-12-21 Thread ashish rawat
Hi,

We are working on a streaming solution where multiple out of order streams
are flowing in the system and we need to join the streams based on a unique
id. We are planning to use redis for this, where for every tuple, we will
lookup if the id exists, we join if it does or else put the tuple into
redis. Also, we need to write the final out to a system through REST API
(the system doesn't provide any other mechanism to write).

Is it a common pattern to read/write to db per tuple? Also, are there any
connectors to write to REST endpoints.

Regards,
Ashish


Re: NLTK with Spark Streaming

2017-12-01 Thread ashish rawat
Thanks Nicholas, but the problem for us is that we want to use NLTK Python
library, since our data scientists are training using that. Rewriting the
inference logic using some other library would be time consuming and in
some cases, it may not even work because of unavailability of some
functions.

On Nov 29, 2017 3:16 AM, "Nicholas Hakobian" <
nicholas.hakob...@rallyhealth.com> wrote:

Depending on your needs, its fairly easy to write a lightweight python
wrapper around the Databricks spark-corenlp library: https://github.com/
databricks/spark-corenlp


Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Sun, Nov 26, 2017 at 8:19 AM, ashish rawat <dceash...@gmail.com> wrote:

> Thanks Holden and Chetan.
>
> Holden - Have you tried it out, do you know the right way to do it?
> Chetan - yes, if we use a Java NLP library, it should not be any issue in
> integrating with spark streaming, but as I pointed out earlier, we want to
> give flexibility to data scientists to use the language and library of
> their choice, instead of restricting them to a library of our choice.
>
> On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> But you can still use Stanford NLP library and distribute through spark
>> right !
>>
>> On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> So it’s certainly doable (it’s not super easy mind you), but until the
>>> arrow udf release goes out it will be rather slow.
>>>
>>> On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <dceash...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Has someone tried running NLTK (python) with Spark Streaming (scala)? I
>>>> was wondering if this is a good idea and what are the right Spark operators
>>>> to do this? The reason we want to try this combination is that we don't
>>>> want to run our transformations in python (pyspark), but after the
>>>> transformations, we need to run some natural language processing operations
>>>> and we don't want to restrict the functions data scientists' can use to
>>>> Spark natural language library. So, Spark streaming with NLTK looks like
>>>> the right option, from the perspective of fast data processing and data
>>>> science flexibility.
>>>>
>>>> Regards,
>>>> Ashish
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>


Re: NLTK with Spark Streaming

2017-11-26 Thread ashish rawat
Thanks Holden and Chetan.

Holden - Have you tried it out, do you know the right way to do it?
Chetan - yes, if we use a Java NLP library, it should not be any issue in
integrating with spark streaming, but as I pointed out earlier, we want to
give flexibility to data scientists to use the language and library of
their choice, instead of restricting them to a library of our choice.

On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:

> But you can still use Stanford NLP library and distribute through spark
> right !
>
> On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> So it’s certainly doable (it’s not super easy mind you), but until the
>> arrow udf release goes out it will be rather slow.
>>
>> On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <dceash...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has someone tried running NLTK (python) with Spark Streaming (scala)? I
>>> was wondering if this is a good idea and what are the right Spark operators
>>> to do this? The reason we want to try this combination is that we don't
>>> want to run our transformations in python (pyspark), but after the
>>> transformations, we need to run some natural language processing operations
>>> and we don't want to restrict the functions data scientists' can use to
>>> Spark natural language library. So, Spark streaming with NLTK looks like
>>> the right option, from the perspective of fast data processing and data
>>> science flexibility.
>>>
>>> Regards,
>>> Ashish
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


NLTK with Spark Streaming

2017-11-25 Thread ashish rawat
Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was
wondering if this is a good idea and what are the right Spark operators to
do this? The reason we want to try this combination is that we don't want
to run our transformations in python (pyspark), but after the
transformations, we need to run some natural language processing operations
and we don't want to restrict the functions data scientists' can use to
Spark natural language library. So, Spark streaming with NLTK looks like
the right option, from the perspective of fast data processing and data
science flexibility.

Regards,
Ashish


Re: Spark based Data Warehouse

2017-11-17 Thread ashish rawat
Thanks everyone for their suggestions. Does any of you take care of auto
scale up and down of your underlying spark clusters on AWS?

On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com" <lucas.g...@gmail.com>
wrote:

Hi Ashish, bear in mind that EMR has some additional tooling available that
smoothes out some S3 problems that you may / almost certainly will
encounter.

We are using Spark / S3 not on EMR and have encountered issues with file
consistency, you can deal with it but be aware it's additional technical
debt that you'll need to own.  We didn't want to own an HDFS cluster so we
consider it worthwhile.

Here are some additional resources:  The video is Steve Loughran talking
about S3.
https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-
best-practices-a767242f3d98
https://www.youtube.com/watch?v=ND4L_zSDqF0

For the record we use S3 heavily but tend to drop our processed data into
databases so they can be more easily consumed by visualization tools.

Good luck!

Gary Lucas

On 13 November 2017 at 20:04, Affan Syed <as...@an10.io> wrote:

> Another option that we are trying internally is to uses Mesos for
> isolating different jobs or groups. Within a single group, using Livy to
> create different spark contexts also works.
>
> - Affan
>
> On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <dceash...@gmail.com> wrote:
>
>> Thanks Sky Yin. This really helps.
>>
>> On Nov 14, 2017 12:11 AM, "Sky Yin" <sky@gmail.com> wrote:
>>
>> We are running Spark in AWS EMR as data warehouse. All data are in S3 and
>> metadata in Hive metastore.
>>
>> We have internal tools to creat juypter notebook on the dev cluster. I
>> guess you can use zeppelin instead, or Livy?
>>
>> We run genie as a job server for the prod cluster, so users have to
>> submit their queries through the genie. For better resource utilization, we
>> rely on Yarn dynamic allocation to balance the load of multiple
>> jobs/queries in Spark.
>>
>> Hope this helps.
>>
>> On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <dceash...@gmail.com>
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> I was trying to understand if anyone here has tried a data warehouse
>>> solution using S3 and Spark SQL. Out of multiple possible options
>>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>>> our aggregates and processing requirements.
>>>
>>> If anyone has tried it out, would like to understand the following:
>>>
>>>1. Is Spark SQL and UDF, able to handle all the workloads?
>>>2. What user interface did you provide for data scientist, data
>>>engineers and analysts
>>>3. What are the challenges in running concurrent queries, by many
>>>users, over Spark SQL? Considering Spark still does not provide spill to
>>>disk, in many scenarios, are there frequent query failures when executing
>>>concurrent queries
>>>4. Are there any open source implementations, which provide
>>>something similar?
>>>
>>>
>>> Regards,
>>> Ashish
>>>
>>
>>
>


Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat
Thanks Sky Yin. This really helps.

On Nov 14, 2017 12:11 AM, "Sky Yin" <sky@gmail.com> wrote:

We are running Spark in AWS EMR as data warehouse. All data are in S3 and
metadata in Hive metastore.

We have internal tools to creat juypter notebook on the dev cluster. I
guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit
their queries through the genie. For better resource utilization, we rely
on Yarn dynamic allocation to balance the load of multiple jobs/queries in
Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <dceash...@gmail.com> wrote:

> Hello Everyone,
>
> I was trying to understand if anyone here has tried a data warehouse
> solution using S3 and Spark SQL. Out of multiple possible options
> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
> our aggregates and processing requirements.
>
> If anyone has tried it out, would like to understand the following:
>
>1. Is Spark SQL and UDF, able to handle all the workloads?
>2. What user interface did you provide for data scientist, data
>engineers and analysts
>3. What are the challenges in running concurrent queries, by many
>users, over Spark SQL? Considering Spark still does not provide spill to
>disk, in many scenarios, are there frequent query failures when executing
>concurrent queries
>4. Are there any open source implementations, which provide something
>similar?
>
>
> Regards,
> Ashish
>


Re: Spark based Data Warehouse

2017-11-13 Thread ashish rawat
Thanks Everyone. I am still not clear on what is the right way to execute
support multiple users, running concurrent queries with Spark. Is it
through multiple spark contexts or through Livy (which creates a single
spark context only).

Also, what kind of isolation is possible with Spark SQL? If one user fires
a big query, then would that choke all other queries in the cluster?

Regards,
Ashish

On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <palw...@hortonworks.com>
wrote:

> Alcon,
>
>
>
> You can most certainly do this. I’ve done benchmarking with Spark SQL and
> the TPCDS queries using S3 as the filesystem.
>
>
>
> Zeppelin and Livy server work well for the dash boarding and concurrent
> query issues:  https://hortonworks.com/blog/livy-a-rest-interface-for-
> apache-spark/
>
>
>
> Livy Server will allow you to create multiple spark contexts via REST:
> https://livy.incubator.apache.org/
>
>
>
> If you are looking for broad SQL functionality I’d recommend instantiating
> a Hive context. And Spark is able to spill to disk à
> https://spark.apache.org/faq.html
>
>
>
> There are multiple companies running spark within their data warehouse
> solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/
> steinbach_dashdb_local_spark/
>
>
>
> Edmunds used Spark to allow business analysts to point Spark to files in
> S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0
>
>
>
> Recommend running some benchmarks and testing query scenarios for your end
> users; but it sounds like you’ll be using it for exploratory analysis.
> Spark is great for this ☺
>
>
>
> -Pat
>
>
>
>
>
> *From: *Vadim Semenov <vadim.seme...@datadoghq.com>
> *Date: *Sunday, November 12, 2017 at 1:06 PM
> *To: *Gourav Sengupta <gourav.sengu...@gmail.com>
> *Cc: *Phillip Henry <londonjava...@gmail.com>, ashish rawat <
> dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak Sharma <
> deepakmc...@gmail.com>, spark users <user@spark.apache.org>
> *Subject: *Re: Spark based Data Warehouse
>
>
>
> It's actually quite simple to answer
>
>
>
> > 1. Is Spark SQL and UDF, able to handle all the workloads?
>
> Yes
>
>
>
> > 2. What user interface did you provide for data scientist, data
> engineers and analysts
>
> Home-grown platform, EMR, Zeppelin
>
>
>
> > What are the challenges in running concurrent queries, by many users,
> over Spark SQL? Considering Spark still does not provide spill to disk, in
> many scenarios, are there frequent query failures when executing concurrent
> queries
>
> You can run separate Spark Contexts, so jobs will be isolated
>
>
>
> > Are there any open source implementations, which provide something
> similar?
>
> Yes, many.
>
>
>
>
>
> On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> Dear Ashish,
>
> what you are asking for involves at least a few weeks of dedicated
> understanding of your used case and then it takes at least 3 to 4 months to
> even propose a solution. You can even build a fantastic data warehouse just
> using C++. The matter depends on lots of conditions. I just think that your
> approach and question needs a lot of modification.
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <londonjava...@gmail.com>
> wrote:
>
> Hi, Ashish.
>
> You are correct in saying that not *all* functionality of Spark is
> spill-to-disk but I am not sure how this pertains to a "concurrent user
> scenario". Each executor will run in its own JVM and is therefore isolated
> from others. That is, if the JVM of one user dies, this should not effect
> another user who is running their own jobs in their own JVMs. The amount of
> resources used by a user can be controlled by the resource manager.
>
> AFAIK, you configure something like YARN to limit the number of cores and
> the amount of memory in the cluster a certain user or group is allowed to
> use for their job. This is obviously quite a coarse-grained approach as (to
> my knowledge) IO is not throttled. I believe people generally use something
> like Apache Ambari to keep an eye on network and disk usage to mitigate
> problems in a shared cluster.
>
> If the user has badly designed their query, it may very well fail with
> OOMEs but this can happen irrespective of whether one user or many is using
> the cluster at a given moment in time.
>
>
>
> Does this help?
>
> Regards,
>
> Phillip
>
>
>
> On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <dceash...@gmail.

Re: Spark based Data Warehouse

2017-11-12 Thread ashish rawat
Thanks Jorn and Phillip. My question was specifically to anyone who have
tried creating a system using spark SQL, as Data Warehouse. I was trying to
check, if someone has tried it and they can help with the kind of workloads
which worked and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of
spark is spill to disk. So it still doesn't provide DB like reliability in
execution. In case of DBs, queries get slow but they don't fail or go out
of memory, specifically in concurrent user scenarios.

Regards,
Ashish

On Nov 12, 2017 3:02 PM, "Phillip Henry" <londonjava...@gmail.com> wrote:

Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the
Spark CLI. Again, the answer is "it depends" (in this case, on the skills
of your customers).

Regarding sharing resources, different teams were limited to their own
queue so they could not hog all the resources. However, people within a
team had to do some horse trading if they had a particularly intensive job
to run. I did feel that this was an area that could be improved. It may be
by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to
disk" as the FAQ says "Spark's operators spill data to disk if it does not
fit in memory" (http://spark.apache.org/faq.html). So, your data will not
normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> What do you mean all possible workloads?
> You cannot prepare any system to do all possible processing.
>
> We do not know the requirements of your data scientists now or in the
> future so it is difficult to say. How do they work currently without the
> new solution? Do they all work on the same data? I bet you will receive on
> your email a lot of private messages trying to sell their solution that
> solves everything - with the information you provided this is impossible to
> say.
>
> Then with every system: have incremental releases but have then in short
> time frames - do not engineer a big system that you will deliver in 2
> years. In the cloud you have the perfect possibility to scale feature but
> also infrastructure wise.
>
> Challenges with concurrent queries is the right definition of the
> scheduler (eg fairscheduler) that not one query take all the resources or
> that long running queries starve.
>
> User interfaces: what could help are notebooks (Jupyter etc) but you may
> need to train your data scientists. Some may know or prefer other tools.
>
> On 12. Nov 2017, at 08:32, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
> I am looking for similar solution more aligned to data scientist group.
> The concern i have is about supporting complex aggregations at runtime .
>
> Thanks
> Deepak
>
> On Nov 12, 2017 12:51, "ashish rawat" <dceash...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I was trying to understand if anyone here has tried a data warehouse
>> solution using S3 and Spark SQL. Out of multiple possible options
>> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
>> our aggregates and processing requirements.
>>
>> If anyone has tried it out, would like to understand the following:
>>
>>1. Is Spark SQL and UDF, able to handle all the workloads?
>>2. What user interface did you provide for data scientist, data
>>engineers and analysts
>>3. What are the challenges in running concurrent queries, by many
>>users, over Spark SQL? Considering Spark still does not provide spill to
>>disk, in many scenarios, are there frequent query failures when executing
>>concurrent queries
>>4. Are there any open source implementations, which provide something
>>similar?
>>
>>
>> Regards,
>> Ashish
>>
>


Spark based Data Warehouse

2017-11-11 Thread ashish rawat
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse
solution using S3 and Spark SQL. Out of multiple possible options
(redshift, presto, hive etc), we were planning to go with Spark SQL, for
our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:

   1. Is Spark SQL and UDF, able to handle all the workloads?
   2. What user interface did you provide for data scientist, data
   engineers and analysts
   3. What are the challenges in running concurrent queries, by many users,
   over Spark SQL? Considering Spark still does not provide spill to disk, in
   many scenarios, are there frequent query failures when executing concurrent
   queries
   4. Are there any open source implementations, which provide something
   similar?


Regards,
Ashish


Re: Spark for Log Analytics

2016-03-31 Thread ashish rawat
Thanks for your replies Steve and Chris.

Steve,

I am creating a real-time pipeline, so I am not looking to dump data to
hdfs rite now. Also, since the log sources would be Nginx, Mongo and
application events, it might not be possible to always route events
directly from the source to flume. Therefore, I thought that "tail -f"
strategy used by fluentd, logstash and others might be the only unifying
solution to collect logs.

Chris,

Can you please elaborate on the Source to Kafka part. Do all event sources
have integration with Kafka. Eg. if you need to send the Server Logs
(Apache/Nginx/Mongo etc) to Kafka, what could be the ideal strategy?

Regards,
Ashish

On Thu, Mar 31, 2016 at 5:16 PM, Chris Fregly <ch...@fregly.com> wrote:

> oh, and I forgot to mention Kafka Streams which has been heavily talked
> about the last few days at Strata here in San Jose.
>
> Streams can simplify a lot of this architecture by perform some
> light-to-medium-complex transformations in Kafka itself.
>
> i'm waiting anxiously for Kafka 0.10 with production-ready Kafka Streams,
> so I can try this out myself - and hopefully remove a lot of extra plumbing.
>
> On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly <ch...@fregly.com> wrote:
>
>> this is a very common pattern, yes.
>>
>> note that in Netflix's case, they're currently pushing all of their logs
>> to a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
>> ElasticSearch, and/or another Kafka Topic for further consumption by
>> internal apps using other technologies like Spark Streaming (instead of
>> Samza).
>>
>> this Fronting Kafka + Samza Router also helps to differentiate between
>> high-priority events (Errors or High Latencies) and normal-priority events
>> (normal User Play or Stop events).
>>
>> here's a recent presentation i did which details this configuration
>> starting at slide 104:
>> http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
>> .
>>
>> btw, Confluent's distribution of Kafka does have a direct Http/REST API
>> which is not recommended for production use, but has worked well for me in
>> the past.
>>
>> these are some additional options to think about, anyway.
>>
>>
>> On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>>
>>> On 31 Mar 2016, at 09:37, ashish rawat <dceash...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I have been evaluating Spark for analysing Application and Server Logs.
>>> I believe there are some downsides of doing this:
>>>
>>> 1. No direct mechanism of collecting log, so need to introduce other
>>> tools like Flume into the pipeline.
>>>
>>>
>>> you need something to collect logs no matter what you run. Flume isn't
>>> so bad; if you bring it up on the same host as the app then you can even
>>> collect logs while the network is playing up.
>>>
>>> Or you can just copy log4j files to HDFS and process them later
>>>
>>> 2. Need to write lots of code for parsing different patterns from logs,
>>> while some of the log analysis tools like logstash or loggly provide it out
>>> of the box
>>>
>>>
>>>
>>> Log parsing is essentially an ETL problem, especially if you don't try
>>> to lock down the log event format.
>>>
>>> You can also configure Log4J to save stuff in an easy-to-parse format
>>> and/or forward directly to your application.
>>>
>>> There's a log4j to flume connector to do that for you,
>>>
>>>
>>> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>>>
>>> or you can output in, say, JSON (
>>> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>>>  )
>>>
>>> I'd go with flume unless you had a need to save the logs locally and
>>> copy them to HDFS laster.
>>>
>>>
>>>
>>> On the benefits side, I believe Spark might be more performant (although
>>> I am yet to benchmark it) and being a generic processing engine, might work
>>> with complex use cases where the out of the box functionality of log
>>> analysis tools is not sufficient (although I don't have any such use case
>>> right now).
>>>
>>> One option I was considering was to use logstash for collection and
>>> basic processing and then sink the processed logs to both elastic search
>>> and kafka. So that Spark Streaming can pick data from Kafka for the complex
>>> use cases, while logstash filters can be used for the simpler use cases.
>>>
>>> I was wondering if someone has already done this evaluation and could
>>> provide me some pointers on how/if to create this pipeline with Spark.
>>>
>>> Regards,
>>> Ashish
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>


Spark for Log Analytics

2016-03-31 Thread ashish rawat
Hi,

I have been evaluating Spark for analysing Application and Server Logs. I
believe there are some downsides of doing this:

1. No direct mechanism of collecting log, so need to introduce other tools
like Flume into the pipeline.
2. Need to write lots of code for parsing different patterns from logs,
while some of the log analysis tools like logstash or loggly provide it out
of the box

On the benefits side, I believe Spark might be more performant (although I
am yet to benchmark it) and being a generic processing engine, might work
with complex use cases where the out of the box functionality of log
analysis tools is not sufficient (although I don't have any such use case
right now).

One option I was considering was to use logstash for collection and basic
processing and then sink the processed logs to both elastic search and
kafka. So that Spark Streaming can pick data from Kafka for the complex use
cases, while logstash filters can be used for the simpler use cases.

I was wondering if someone has already done this evaluation and could
provide me some pointers on how/if to create this pipeline with Spark.

Regards,
Ashish


Re: Save GraphX to disk

2015-11-20 Thread Ashish Rawat
Hi Todd,

Could you please provide an example of doing this. Mazerunner seems to be doing 
something similar with Neo4j but it goes via hdfs and updates only the graph 
properties. Is there a direct way to do this with Neo4j or Titan?

Regards,
Ashish

From: SLiZn Liu >
Date: Saturday, 14 November 2015 7:44 am
To: Gaurav Kumar >, 
"user@spark.apache.org" 
>
Subject: Re: Save GraphX to disk

Hi Gaurav,

Your graph can be saved to graph databases like Neo4j or Titan through their 
drivers, that eventually saved to the disk.

BR,
Todd

Gaurav Kumar
gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道:
Hi,

I was wondering how to save a graph to disk and load it back again. I know how 
to save vertices and edges to disk and construct the graph from them, not sure 
if there's any method to save the graph itself to disk.

Best Regards,
Gaurav Kumar
Big Data * Data Science * Photography * Music
+91 9953294125


Spark Application Hung

2015-03-24 Thread Ashish Rawat
Hi,

We are observing a hung spark application when one of the yarn datanode 
(running multiple spark executors) go down.

Setup details:

  *   Spark: 1.2.1
  *   Hadoop: 2.4.0
  *   Spark Application Mode: yarn-client
  *   2 datanodes (DN1, DN2)
  *   6 spark executors (initially 3 executors on both DN1 and DN2, after 
rebooting DN2, changes to 4 executors on DN1 and 2 executors on DN2)

Scenario:

When one of the datanodes (DN2) is brought down, the application gets hung, 
with spark driver continuously showing the following warning:

15/03/24 12:39:26 WARN TaskSetManager: Lost task 5.0 in stage 232.0 (TID 37941, 
DN1): FetchFailed(null, shuffleId=155, mapId=-1, reduceId=5, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 155
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)


When DN2 is brought down, one executor gets launched on DN1. When DN2 is 
brought back up after 15mins, 2 executors get launched on it.
All the executors (including the ones which got launched after DN2 comes back), 
keep showing the following errors:

15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 155, fetching them
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 155, fetching them
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
actor = Actor[akka.tcp://sparkDriver@NN1:44353/user/MapOutputTracker#-957394722]
15/03/24 12:43:30 INFO spark.MapOutputTrackerWorker: Got the output locations
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for 
shuffle 155
15/03/24 12:43:30 ERROR spark.MapOutputTracker: Missing an output location for 
shuffle 155
15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
44623
15/03/24 12:43:30 INFO executor.Executor: Running task 5.0 in stage 232.960 
(TID 44623)
15/03/24 12:43:30 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
44629