Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-07-01 Thread Koert Kuipers
valid functions can be written for reduce and merge when the zero is null.
so not being able to provide null as the initial value is something
troublesome.

i guess the proper way to do this is use Option, and have the None be the
zero, which is what i assumed you did?
unfortunately last time i tried using scala Options with spark Aggregators
it didnt work quite well. see:
https://issues.apache.org/jira/browse/SPARK-15810

lifting a semigroup into a monoid like this using Option is fairly typical,
so either null or None has to work or else this api will be somewhat
unpleasant to use for anything practical.

for an example of this lifting on a related Aggregator class:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420

it would be nice to provide a similar convenience method for spark's
Aggregator. basically if the user provides no zero the output is
Option[OUT] instead of OUT, which spark translates into OUT being nullable.​


On Fri, Jul 1, 2016 at 5:04 PM, Amit Sela  wrote:

> Thanks for pointing that Koert!
>
> I understand now why zero() and not init(a: IN), though I still don't see
> a good reason to skip the aggregation if zero returns null.
> If the user did it, it's on him to take care of null cases in
> reduce/merge, but it opens-up the possibility to use the input to create
> the buffer for the aggregator.
> Wouldn't that at least enable the functionality discussed in SPARK-15598 ?
> without changing how the Aggregator works.
>
> I bypassed it by using Optional (Guava) because I'm using the Java API,
> but it's a bit cumbersome...
>
> Thanks,
> Amit
>
> On Thu, Jun 30, 2016 at 1:54 AM Koert Kuipers  wrote:
>
>> its the difference between a semigroup and a monoid, and yes max does not
>> easily fit into a monoid.
>>
>> see also discussion here:
>> https://issues.apache.org/jira/browse/SPARK-15598
>>
>> On Mon, Jun 27, 2016 at 3:19 AM, Amit Sela  wrote:
>>
>>> OK. I see that, but the current (provided) implementations are very
>>> naive - Sum, Count, Average -let's take Max for example: I guess zero()
>>> would be set to some value like Long.MIN_VALUE, but what if you trigger (I
>>> assume in the future Spark streaming will support time-based triggers) for
>>> a result and there are no events ?
>>>
>>> And like I said, for a more general use case: What if my zero() function
>>> depends on my input ?
>>>
>>> I just don't see the benefit of this behaviour, though I realise this is
>>> the implementation.
>>>
>>> Thanks,
>>> Amit
>>>
>>> On Sun, Jun 26, 2016 at 2:09 PM Takeshi Yamamuro 
>>> wrote:
>>>
 No, TypedAggregateExpression that uses Aggregator#zero is different
 between v2.0 and v1.6.
 v2.0:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L91
 v1.6:
 https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L115

 // maropu


 On Sun, Jun 26, 2016 at 8:03 PM, Amit Sela 
 wrote:

> This "if (value == null)" condition you point to exists in 1.6 branch
> as well, so that's probably not the reason.
>
> On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
>
>> Whatever it is, this is expected; if an initial value is null, spark
>> codegen removes all the aggregates.
>> See:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199
>>
>> // maropu
>>
>> On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela 
>> wrote:
>>
>>> Not sure about what's the rule in case of `b + null = null` but the
>>> same code works perfectly in 1.6.1, just tried it..
>>>
>>> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro <
>>> linguin@gmail.com> wrote:
>>>
 Hi,

 This behaviour seems to be expected because you must ensure `b +
 zero() = b`
 The your case `b + null = null` breaks this rule.
 This is the same with v1.6.1.
 See:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57

 // maropu


 On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela 
 wrote:

> Sometimes, the BUF for the aggregator may depend on the actual
> input.. and while this passes the responsibility to handle null in
> merge/reduce to the developer, it sounds fine to me if he is the one 
> who
> put null in zero() anyway.
> Now, it seems that the aggregation is skipped entirely when
> 

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
I found the jira for the issue will there be a fix in future ? or no fix ?

https://issues.apache.org/jira/browse/SPARK-6221



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
Hi Neelesh,

I told you in my emails it's not spark-Scala application , I am working on just 
spark SQL.

I am launching spark-SQL shell and running my hive code inside spark SQL she'll.

Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts 
fictions like collasece which is spark Scala function.

What I am trying to do is below.

from(select * from source_table where load_date="2016-09-23")a
Insert overwrite table target_table Select * 


Thanks
Sri

Sent from my iPhone

> On 1 Jul 2016, at 17:35, nsalian [via Apache Spark User List] 
>  wrote:
> 
> Hi Sri, 
> 
> Thanks for the question. 
> You can simply start by doing this in the initial stage: 
> 
> val sqlContext = new SQLContext(sc) 
> val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json 
> example here 
> 
> where the argument is the path to the file(s). This will reduce the 
> partitions. 
> You can proceed with repartitioning the data further on. The goal would be to 
> reduce the number of files in the end as you do a saveAsParquet. 
> 
> Hope that helps.
> Neelesh S. Salian 
> Cloudera
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
> To unsubscribe from spark parquet too many small files ?, click here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27266.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark parquet too many small files ?

2016-07-01 Thread nsalian
Hi Sri,

Thanks for the question.
You can simply start by doing this in the initial stage:

val sqlContext = new SQLContext(sc)
val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json
example here

where the argument is the path to the file(s). This will reduce the
partitions.
You can proceed with repartitioning the data further on. The goal would be
to reduce the number of files in the end as you do a saveAsParquet.

Hope that helps.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
Hi All, 

I am running hive in spark-sql in yarn client mode, the sql is pretty simple
load dynamic partitions to target parquet table.

I used hive configurations parameters such as  (set
hive.merge.smallfiles.avgsize=25600;set
hive.merge.size.per.task=256000;) which usually merges small files to
256mb block size these parameters are supported in spark-sql is there other
way around to merge number of small parquet files to large one.

if its a scala application I can use collasece() function or repartition but
here we are not using spark-scala application its just plain spark-sql.


Thanks
Sri 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Enforcing shuffle hash join

2016-07-01 Thread Lalitha MV
Hi,

In order to force broadcast hash join, we can set
the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
shuffle hash join in spark sql?


Thanks,
Lalitha


Re: Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread Charles Allen
I'm having the same difficulty porting
https://github.com/metamx/druid-spark-batch/tree/spark2 over to spark2.x,
where I have to go track down who is pulling in bad jackson versions.



On Fri, Jul 1, 2016 at 11:59 AM Sean Owen  wrote:

> Are you just asking why you can't use 2.5.3 in your app? because
> Jackson isn't shaded, which is sort of the bad news. But just use
> 2.6.5 too, ideally. I don't know where 2.6.1 is coming from, but Spark
> doesn't use it.
>
> On Fri, Jul 1, 2016 at 5:48 PM,   wrote:
> > In my project I found the library which brings Jackson core 2.6.5 and it
> is
> > used in conjunction with the requested Jackson scala module 2.5.3 wanted
> by
> > spark 2.0.0 preview. At runtime it's the cause of exception.
> >
> > Now I have excluded 2.6.5 using sbt but it could be dangerous for the
> other
> > library.
> >
> > Why this regression to Jackson 2.5.3 switching from 2.0.0 snapshot to
> > preview ?
> >
> > Thanks
> > Paolo
> >
> > Get Outlook for Android
> >
> >
> >
> >
> > On Fri, Jul 1, 2016 at 4:24 PM +0200, "Paolo Patierno" <
> ppatie...@live.com>
> > wrote:
> >
> > Hi,
> >
> > developing a custom receiver up today I used spark version
> "2.0.0-SNAPSHOT"
> > and scala version 2.11.7.
> > With these version all tests work fine.
> >
> > I have just switching to "2.0.0-preview" as spark version but not I have
> > following error :
> >
> > An exception or error caused a run to abort: class
> > com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides
> > final method
> >
> withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
> > java.lang.VerifyError: class
> > com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides
> > final method
> >
> withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
> > at java.lang.ClassLoader.defineClass1(Native Method)
> >
> > I see that using 2.0.0-SNAPSHOT the jackson core version is 2.6.5 ... now
> > 2.6.1 with module scala 2.5.3.
> >
> > Thanks,
> > Paolo.
> >
> > Paolo Patierno
> > Senior Software Engineer (IoT) @ Red Hat
> > Microsoft MVP on Windows Embedded & IoT
> > Microsoft Azure Advisor
> >
> > Twitter : @ppatierno
> > Linkedin : paolopatierno
> > Blog : DevExperience
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Best way to merge final output part files created by Spark job

2016-07-01 Thread kali.tumm...@gmail.com
Try using collasece function to repartition to desired number of partitions
files, to merge already output files use hive and insert overwrite table
using below options.

set hive.merge.smallfiles.avgsize=256;
set hive.merge.size.per.task=256;
set 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681p27263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: output part files max size

2016-07-01 Thread kali.tumm...@gmail.com
I am not sure but you can use collasece function to reduce number of output
files .

Thanks
Sri



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/output-part-files-max-size-tp17013p27262.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark driver assigning splits to incorrect workers

2016-07-01 Thread Ted Yu
I guess you extended some InputFormat for providing locality information.

Can you share some code snippet ?

Which non-distributed file system are you using ?

Thanks

On Fri, Jul 1, 2016 at 2:46 PM, Raajen  wrote:

> I would like to use Spark on a non-distributed file system but am having
> trouble getting the driver to assign tasks to the workers that are local to
> the files. I have extended InputSplits to create my own version of
> FileSplits, so that each worker gets a bit more information than the
> default
> FileSplit provides. I thought that the driver would assign splits based on
> their locality. But I have found that the driver will send these splits to
> workers seemingly at random -- even the very first split will go to a node
> with a different IP than the split specifies. I can see that I am providing
> the right node address via GetLocations. I also set spark.locality.wait to
> a
> high value, but the same misassignment keeps happening.
>
> I am using newAPIHadoopFile to create my RDD. InputFormat is creating the
> required splits, but not all splits refer to the same file or the same
> worker IP.
>
> What else I can check, or change, to force the driver to send these tasks
> to
> the right workers?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-driver-assigning-splits-to-incorrect-workers-tp27261.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark driver assigning splits to incorrect workers

2016-07-01 Thread Raajen
I would like to use Spark on a non-distributed file system but am having
trouble getting the driver to assign tasks to the workers that are local to
the files. I have extended InputSplits to create my own version of
FileSplits, so that each worker gets a bit more information than the default
FileSplit provides. I thought that the driver would assign splits based on
their locality. But I have found that the driver will send these splits to
workers seemingly at random -- even the very first split will go to a node
with a different IP than the split specifies. I can see that I am providing
the right node address via GetLocations. I also set spark.locality.wait to a
high value, but the same misassignment keeps happening.

I am using newAPIHadoopFile to create my RDD. InputFormat is creating the
required splits, but not all splits refer to the same file or the same
worker IP. 

What else I can check, or change, to force the driver to send these tasks to
the right workers?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-driver-assigning-splits-to-incorrect-workers-tp27261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-07-01 Thread Amit Sela
Thanks for pointing that Koert!

I understand now why zero() and not init(a: IN), though I still don't see a
good reason to skip the aggregation if zero returns null.
If the user did it, it's on him to take care of null cases in reduce/merge,
but it opens-up the possibility to use the input to create the buffer for
the aggregator.
Wouldn't that at least enable the functionality discussed in SPARK-15598 ?
without changing how the Aggregator works.

I bypassed it by using Optional (Guava) because I'm using the Java API, but
it's a bit cumbersome...

Thanks,
Amit

On Thu, Jun 30, 2016 at 1:54 AM Koert Kuipers  wrote:

> its the difference between a semigroup and a monoid, and yes max does not
> easily fit into a monoid.
>
> see also discussion here:
> https://issues.apache.org/jira/browse/SPARK-15598
>
> On Mon, Jun 27, 2016 at 3:19 AM, Amit Sela  wrote:
>
>> OK. I see that, but the current (provided) implementations are very naive
>> - Sum, Count, Average -let's take Max for example: I guess zero() would be
>> set to some value like Long.MIN_VALUE, but what if you trigger (I assume in
>> the future Spark streaming will support time-based triggers) for a result
>> and there are no events ?
>>
>> And like I said, for a more general use case: What if my zero() function
>> depends on my input ?
>>
>> I just don't see the benefit of this behaviour, though I realise this is
>> the implementation.
>>
>> Thanks,
>> Amit
>>
>> On Sun, Jun 26, 2016 at 2:09 PM Takeshi Yamamuro 
>> wrote:
>>
>>> No, TypedAggregateExpression that uses Aggregator#zero is different
>>> between v2.0 and v1.6.
>>> v2.0:
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L91
>>> v1.6:
>>> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L115
>>>
>>> // maropu
>>>
>>>
>>> On Sun, Jun 26, 2016 at 8:03 PM, Amit Sela  wrote:
>>>
 This "if (value == null)" condition you point to exists in 1.6 branch
 as well, so that's probably not the reason.

 On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro 
 wrote:

> Whatever it is, this is expected; if an initial value is null, spark
> codegen removes all the aggregates.
> See:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199
>
> // maropu
>
> On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela 
> wrote:
>
>> Not sure about what's the rule in case of `b + null = null` but the
>> same code works perfectly in 1.6.1, just tried it..
>>
>> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro <
>> linguin@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> This behaviour seems to be expected because you must ensure `b +
>>> zero() = b`
>>> The your case `b + null = null` breaks this rule.
>>> This is the same with v1.6.1.
>>> See:
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57
>>>
>>> // maropu
>>>
>>>
>>> On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela 
>>> wrote:
>>>
 Sometimes, the BUF for the aggregator may depend on the actual
 input.. and while this passes the responsibility to handle null in
 merge/reduce to the developer, it sounds fine to me if he is the one 
 who
 put null in zero() anyway.
 Now, it seems that the aggregation is skipped entirely when zero()
 = null. Not sure if that was the behaviour in 1.6

 Is this behaviour wanted ?

 Thanks,
 Amit

 Aggregator example:

 public static class Agg extends Aggregator, 
 Integer, Integer> {

   @Override
   public Integer zero() {
 return null;
   }

   @Override
   public Integer reduce(Integer b, Tuple2 a) {
 if (b == null) {
   b = 0;
 }
 return b + a._2();
   }

   @Override
   public Integer merge(Integer b1, Integer b2) {
 if (b1 == null) {
   return b2;
 } else if (b2 == null) {
   return b1;
 } else {
   return b1 + b2;
 }
   }


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>


Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
What about yarn-cluster mode?

2016-07-01 11:24 GMT-07:00 Egor Pahomov :

> Separate bad users with bad quires from good users with good quires. Spark
> do not provide no scope separation out of the box.
>
> 2016-07-01 11:12 GMT-07:00 Jeff Zhang :
>
>> I think so, any reason you want to deploy multiple thrift server on one
>> machine ?
>>
>> On Fri, Jul 1, 2016 at 10:59 AM, Egor Pahomov 
>> wrote:
>>
>>> Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT
>>> Jeff, thanks, I would try, but from your answer I'm getting the feeling,
>>> that I'm trying some very rare case?
>>>
>>> 2016-07-01 10:54 GMT-07:00 Jeff Zhang :
>>>
 This is not a bug, because these 2 processes use the same SPARK_PID_DIR
 which is /tmp by default.  Although you can resolve this by using
 different SPARK_PID_DIR, I suspect you would still have other issues like
 port conflict. I would suggest you to deploy one spark thrift server per
 machine for now. If stick to deploy multiple spark thrift server on one
 machine, then define different SPARK_CONF_DIR, SPARK_LOG_DIR and
 SPARK_PID_DIR for your 2 instances of spark thrift server. Not sure if
 there's other conflicts. but please try first.


 On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov 
 wrote:

> I get
>
> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
> process 28989.  Stop it first."
>
> Is it a bug?
>
> 2016-07-01 10:10 GMT-07:00 Jeff Zhang :
>
>> I don't think the one instance per machine is true.  As long as you
>> resolve the conflict issue such as port conflict, pid file, log file and
>> etc, you can run multiple instances of spark thrift server.
>>
>> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov 
>> wrote:
>>
>>> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
>>> bother me -
>>>
>>> 1) One instance per machine
>>> 2) Yarn client only(not yarn cluster)
>>>
>>> Are there any architectural reasons for such limitations? About
>>> yarn-client I might understand in theory - master is the same process 
>>> as a
>>> server, so it makes some sense, but it's really inconvenient - I need a 
>>> lot
>>> of memory on my driver machine. Reasons for one instance per machine I 
>>> do
>>> not understand.
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 


*Sincerely yoursEgor Pakhomov*


Re: Deploying ML Pipeline Model

2016-07-01 Thread Saurabh Sardeshpande
Hi Nick,

Thanks for the answer. Do you think an implementation like the one in this
article is infeasible in production for say, hundreds of queries per
minute?
https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2.
The article uses Flask to define routes and Spark for evaluating requests.

Regards,
Saurabh






On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath 
wrote:

> Generally there are 2 ways to use a trained pipeline model - (offline)
> batch scoring, and real-time online scoring.
>
> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
> certainly loading the model back in Spark and feeding new data through the
> pipeline for prediction works just fine, and this is essentially what is
> supported in 1.6 (and more or less full coverage in 2.0). For large batch
> cases this can be quite efficient.
>
> However, usually for real-time use cases, the latency required is fairly
> low - of the order of a few ms to a few 100ms for a request (some examples
> include recommendations, ad-serving, fraud detection etc).
>
> In these cases, using Spark has 2 issues: (1) latency for prediction on
> the pipeline, which is based on DataFrames and therefore distributed
> execution, is usually fairly high "per request"; (2) this requires pulling
> in all of Spark for your real-time serving layer (or running a full Spark
> cluster), which is usually way too much overkill - all you really need for
> serving is a bit of linear algebra and some basic transformations.
>
> So for now, unfortunately there is not much in the way of options for
> exporting your pipelines and serving them outside of Spark - the
> JPMML-based project mentioned on this thread is one option. The other
> option at this point is to write your own export functionality and your own
> serving layer.
>
> There is (very initial) movement towards improving the local serving
> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
> was the "first step" in this process).
>
> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski  wrote:
>
>> Hi Rishabh,
>>
>> I've just today had similar conversation about how to do a ML Pipeline
>> deployment and couldn't really answer this question and more because I
>> don't really understand the use case.
>>
>> What would you expect from ML Pipeline model deployment? You can save
>> your model to a file by model.write.overwrite.save("model_v1").
>>
>> model_v1
>> |-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- stages
>> |-- 0_regexTok_b4265099cc1c
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> |-- 1_hashingTF_8de997cf54ba
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- 2_linReg_3942a71d2c0e
>> |-- data
>> |   |-- _SUCCESS
>> |   |-- _common_metadata
>> |   |-- _metadata
>> |   `--
>> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>> `-- metadata
>> |-- _SUCCESS
>> `-- part-0
>>
>> 9 directories, 12 files
>>
>> What would you like to have outside SparkContext? What's wrong with
>> using Spark? Just curious hoping to understand the use case better.
>> Thanks.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj 
>> wrote:
>> > Hi All,
>> >
>> > I am looking for ways to deploy a ML Pipeline model in production .
>> > Spark has already proved to be a one of the best framework for model
>> > training and creation, but once the ml pipeline model is ready how can I
>> > deploy it outside spark context ?
>> > MLlib model has toPMML method but today Pipeline model can not be saved
>> to
>> > PMML. There are some frameworks like MLeap which are trying to abstract
>> > Pipeline Model and provide ML Pipeline Model deployment outside spark
>> > context,but currently they don't have most of the ml transformers and
>> > estimators.
>> > I am looking for related work going on this area.
>> > Any pointers will be helpful.
>> >
>> > Thanks,
>> > Rishabh.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Deploying ML Pipeline Model

2016-07-01 Thread Sean Owen
(The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use
JPMML in Spark and couldn't otherwise because the Affero license is
not Apache compatible.)

On Fri, Jul 1, 2016 at 8:16 PM, Nick Pentreath  wrote:
> I believe open-scoring is one of the well-known PMML serving frameworks in
> Java land (https://github.com/jpmml/openscoring). One can also use the raw
> https://github.com/jpmml/jpmml-evaluator for embedding in apps.
>
> (Note the license on both of these is AGPL - the older version of JPMML used
> to be Apache2 if I recall correctly).
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
I believe open-scoring is one of the well-known PMML serving frameworks in
Java land (https://github.com/jpmml/openscoring). One can also use the raw
https://github.com/jpmml/jpmml-evaluator for embedding in apps.

(Note the license on both of these is AGPL - the older version of JPMML
used to be Apache2 if I recall correctly).

On Fri, 1 Jul 2016 at 20:15 Jacek Laskowski  wrote:

> Hi Nick,
>
> Thanks a lot for the exhaustive and prompt response! (In the meantime
> I watched a video about PMML to get a better understanding of the
> topic).
>
> What are the tools that could "consume" PMML exports (after running
> JPMML)? What tools would be the endpoint to deliver low-latency
> predictions by doing this "a bit of linear algebra and some basic
> transformations"?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 1, 2016 at 6:47 PM, Nick Pentreath 
> wrote:
> > Generally there are 2 ways to use a trained pipeline model - (offline)
> batch
> > scoring, and real-time online scoring.
> >
> > For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
> > certainly loading the model back in Spark and feeding new data through
> the
> > pipeline for prediction works just fine, and this is essentially what is
> > supported in 1.6 (and more or less full coverage in 2.0). For large batch
> > cases this can be quite efficient.
> >
> > However, usually for real-time use cases, the latency required is fairly
> low
> > - of the order of a few ms to a few 100ms for a request (some examples
> > include recommendations, ad-serving, fraud detection etc).
> >
> > In these cases, using Spark has 2 issues: (1) latency for prediction on
> the
> > pipeline, which is based on DataFrames and therefore distributed
> execution,
> > is usually fairly high "per request"; (2) this requires pulling in all of
> > Spark for your real-time serving layer (or running a full Spark cluster),
> > which is usually way too much overkill - all you really need for serving
> is
> > a bit of linear algebra and some basic transformations.
> >
> > So for now, unfortunately there is not much in the way of options for
> > exporting your pipelines and serving them outside of Spark - the
> JPMML-based
> > project mentioned on this thread is one option. The other option at this
> > point is to write your own export functionality and your own serving
> layer.
> >
> > There is (very initial) movement towards improving the local serving
> > possibilities (see https://issues.apache.org/jira/browse/SPARK-13944
> which
> > was the "first step" in this process).
> >
> > On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski  wrote:
> >>
> >> Hi Rishabh,
> >>
> >> I've just today had similar conversation about how to do a ML Pipeline
> >> deployment and couldn't really answer this question and more because I
> >> don't really understand the use case.
> >>
> >> What would you expect from ML Pipeline model deployment? You can save
> >> your model to a file by model.write.overwrite.save("model_v1").
> >>
> >> model_v1
> >> |-- metadata
> >> |   |-- _SUCCESS
> >> |   `-- part-0
> >> `-- stages
> >> |-- 0_regexTok_b4265099cc1c
> >> |   `-- metadata
> >> |   |-- _SUCCESS
> >> |   `-- part-0
> >> |-- 1_hashingTF_8de997cf54ba
> >> |   `-- metadata
> >> |   |-- _SUCCESS
> >> |   `-- part-0
> >> `-- 2_linReg_3942a71d2c0e
> >> |-- data
> >> |   |-- _SUCCESS
> >> |   |-- _common_metadata
> >> |   |-- _metadata
> >> |   `--
> >> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
> >> `-- metadata
> >> |-- _SUCCESS
> >> `-- part-0
> >>
> >> 9 directories, 12 files
> >>
> >> What would you like to have outside SparkContext? What's wrong with
> >> using Spark? Just curious hoping to understand the use case better.
> >> Thanks.
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj 
> >> wrote:
> >> > Hi All,
> >> >
> >> > I am looking for ways to deploy a ML Pipeline model in production .
> >> > Spark has already proved to be a one of the best framework for model
> >> > training and creation, but once the ml pipeline model is ready how
> can I
> >> > deploy it outside spark context ?
> >> > MLlib model has toPMML method but today Pipeline model can not be
> saved
> >> > to
> >> > PMML. There are some frameworks like MLeap which are trying to
> abstract
> >> > Pipeline Model and provide ML Pipeline Model deployment outside spark
> >> > context,but currently they don't have most of 

Re: Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread Sean Owen
Are you just asking why you can't use 2.5.3 in your app? because
Jackson isn't shaded, which is sort of the bad news. But just use
2.6.5 too, ideally. I don't know where 2.6.1 is coming from, but Spark
doesn't use it.

On Fri, Jul 1, 2016 at 5:48 PM,   wrote:
> In my project I found the library which brings Jackson core 2.6.5 and it is
> used in conjunction with the requested Jackson scala module 2.5.3 wanted by
> spark 2.0.0 preview. At runtime it's the cause of exception.
>
> Now I have excluded 2.6.5 using sbt but it could be dangerous for the other
> library.
>
> Why this regression to Jackson 2.5.3 switching from 2.0.0 snapshot to
> preview ?
>
> Thanks
> Paolo
>
> Get Outlook for Android
>
>
>
>
> On Fri, Jul 1, 2016 at 4:24 PM +0200, "Paolo Patierno" 
> wrote:
>
> Hi,
>
> developing a custom receiver up today I used spark version "2.0.0-SNAPSHOT"
> and scala version 2.11.7.
> With these version all tests work fine.
>
> I have just switching to "2.0.0-preview" as spark version but not I have
> following error :
>
> An exception or error caused a run to abort: class
> com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides
> final method
> withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
> java.lang.VerifyError: class
> com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides
> final method
> withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
> at java.lang.ClassLoader.defineClass1(Native Method)
>
> I see that using 2.0.0-SNAPSHOT the jackson core version is 2.6.5 ... now
> 2.6.1 with module scala 2.5.3.
>
> Thanks,
> Paolo.
>
> Paolo Patierno
> Senior Software Engineer (IoT) @ Red Hat
> Microsoft MVP on Windows Embedded & IoT
> Microsoft Azure Advisor
>
> Twitter : @ppatierno
> Linkedin : paolopatierno
> Blog : DevExperience

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
Separate bad users with bad quires from good users with good quires. Spark
do not provide no scope separation out of the box.

2016-07-01 11:12 GMT-07:00 Jeff Zhang :

> I think so, any reason you want to deploy multiple thrift server on one
> machine ?
>
> On Fri, Jul 1, 2016 at 10:59 AM, Egor Pahomov 
> wrote:
>
>> Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT
>> Jeff, thanks, I would try, but from your answer I'm getting the feeling,
>> that I'm trying some very rare case?
>>
>> 2016-07-01 10:54 GMT-07:00 Jeff Zhang :
>>
>>> This is not a bug, because these 2 processes use the same SPARK_PID_DIR
>>> which is /tmp by default.  Although you can resolve this by using
>>> different SPARK_PID_DIR, I suspect you would still have other issues like
>>> port conflict. I would suggest you to deploy one spark thrift server per
>>> machine for now. If stick to deploy multiple spark thrift server on one
>>> machine, then define different SPARK_CONF_DIR, SPARK_LOG_DIR and
>>> SPARK_PID_DIR for your 2 instances of spark thrift server. Not sure if
>>> there's other conflicts. but please try first.
>>>
>>>
>>> On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov 
>>> wrote:
>>>
 I get

 "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
 process 28989.  Stop it first."

 Is it a bug?

 2016-07-01 10:10 GMT-07:00 Jeff Zhang :

> I don't think the one instance per machine is true.  As long as you
> resolve the conflict issue such as port conflict, pid file, log file and
> etc, you can run multiple instances of spark thrift server.
>
> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov 
> wrote:
>
>> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
>> bother me -
>>
>> 1) One instance per machine
>> 2) Yarn client only(not yarn cluster)
>>
>> Are there any architectural reasons for such limitations? About
>> yarn-client I might understand in theory - master is the same process as 
>> a
>> server, so it makes some sense, but it's really inconvenient - I need a 
>> lot
>> of memory on my driver machine. Reasons for one instance per machine I do
>> not understand.
>>
>> --
>>
>>
>> *Sincerely yoursEgor Pakhomov*
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



 --


 *Sincerely yoursEgor Pakhomov*

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>>
>> *Sincerely yoursEgor Pakhomov*
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 


*Sincerely yoursEgor Pakhomov*


Re: Deploying ML Pipeline Model

2016-07-01 Thread Jacek Laskowski
Hi Nick,

Thanks a lot for the exhaustive and prompt response! (In the meantime
I watched a video about PMML to get a better understanding of the
topic).

What are the tools that could "consume" PMML exports (after running
JPMML)? What tools would be the endpoint to deliver low-latency
predictions by doing this "a bit of linear algebra and some basic
transformations"?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 1, 2016 at 6:47 PM, Nick Pentreath  wrote:
> Generally there are 2 ways to use a trained pipeline model - (offline) batch
> scoring, and real-time online scoring.
>
> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
> certainly loading the model back in Spark and feeding new data through the
> pipeline for prediction works just fine, and this is essentially what is
> supported in 1.6 (and more or less full coverage in 2.0). For large batch
> cases this can be quite efficient.
>
> However, usually for real-time use cases, the latency required is fairly low
> - of the order of a few ms to a few 100ms for a request (some examples
> include recommendations, ad-serving, fraud detection etc).
>
> In these cases, using Spark has 2 issues: (1) latency for prediction on the
> pipeline, which is based on DataFrames and therefore distributed execution,
> is usually fairly high "per request"; (2) this requires pulling in all of
> Spark for your real-time serving layer (or running a full Spark cluster),
> which is usually way too much overkill - all you really need for serving is
> a bit of linear algebra and some basic transformations.
>
> So for now, unfortunately there is not much in the way of options for
> exporting your pipelines and serving them outside of Spark - the JPMML-based
> project mentioned on this thread is one option. The other option at this
> point is to write your own export functionality and your own serving layer.
>
> There is (very initial) movement towards improving the local serving
> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
> was the "first step" in this process).
>
> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski  wrote:
>>
>> Hi Rishabh,
>>
>> I've just today had similar conversation about how to do a ML Pipeline
>> deployment and couldn't really answer this question and more because I
>> don't really understand the use case.
>>
>> What would you expect from ML Pipeline model deployment? You can save
>> your model to a file by model.write.overwrite.save("model_v1").
>>
>> model_v1
>> |-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- stages
>> |-- 0_regexTok_b4265099cc1c
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> |-- 1_hashingTF_8de997cf54ba
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- 2_linReg_3942a71d2c0e
>> |-- data
>> |   |-- _SUCCESS
>> |   |-- _common_metadata
>> |   |-- _metadata
>> |   `--
>> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>> `-- metadata
>> |-- _SUCCESS
>> `-- part-0
>>
>> 9 directories, 12 files
>>
>> What would you like to have outside SparkContext? What's wrong with
>> using Spark? Just curious hoping to understand the use case better.
>> Thanks.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj 
>> wrote:
>> > Hi All,
>> >
>> > I am looking for ways to deploy a ML Pipeline model in production .
>> > Spark has already proved to be a one of the best framework for model
>> > training and creation, but once the ml pipeline model is ready how can I
>> > deploy it outside spark context ?
>> > MLlib model has toPMML method but today Pipeline model can not be saved
>> > to
>> > PMML. There are some frameworks like MLeap which are trying to abstract
>> > Pipeline Model and provide ML Pipeline Model deployment outside spark
>> > context,but currently they don't have most of the ml transformers and
>> > estimators.
>> > I am looking for related work going on this area.
>> > Any pointers will be helpful.
>> >
>> > Thanks,
>> > Rishabh.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
I think so, any reason you want to deploy multiple thrift server on one
machine ?

On Fri, Jul 1, 2016 at 10:59 AM, Egor Pahomov 
wrote:

> Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT
> Jeff, thanks, I would try, but from your answer I'm getting the feeling,
> that I'm trying some very rare case?
>
> 2016-07-01 10:54 GMT-07:00 Jeff Zhang :
>
>> This is not a bug, because these 2 processes use the same SPARK_PID_DIR
>> which is /tmp by default.  Although you can resolve this by using
>> different SPARK_PID_DIR, I suspect you would still have other issues like
>> port conflict. I would suggest you to deploy one spark thrift server per
>> machine for now. If stick to deploy multiple spark thrift server on one
>> machine, then define different SPARK_CONF_DIR, SPARK_LOG_DIR and
>> SPARK_PID_DIR for your 2 instances of spark thrift server. Not sure if
>> there's other conflicts. but please try first.
>>
>>
>> On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov 
>> wrote:
>>
>>> I get
>>>
>>> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
>>> process 28989.  Stop it first."
>>>
>>> Is it a bug?
>>>
>>> 2016-07-01 10:10 GMT-07:00 Jeff Zhang :
>>>
 I don't think the one instance per machine is true.  As long as you
 resolve the conflict issue such as port conflict, pid file, log file and
 etc, you can run multiple instances of spark thrift server.

 On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov 
 wrote:

> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
> bother me -
>
> 1) One instance per machine
> 2) Yarn client only(not yarn cluster)
>
> Are there any architectural reasons for such limitations? About
> yarn-client I might understand in theory - master is the same process as a
> server, so it makes some sense, but it's really inconvenient - I need a 
> lot
> of memory on my driver machine. Reasons for one instance per machine I do
> not understand.
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 
Best Regards

Jeff Zhang


Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
Takeshi, of course I used different HIVE_SERVER2_THRIFT_PORT
Jeff, thanks, I would try, but from your answer I'm getting the feeling,
that I'm trying some very rare case?

2016-07-01 10:54 GMT-07:00 Jeff Zhang :

> This is not a bug, because these 2 processes use the same SPARK_PID_DIR
> which is /tmp by default.  Although you can resolve this by using
> different SPARK_PID_DIR, I suspect you would still have other issues like
> port conflict. I would suggest you to deploy one spark thrift server per
> machine for now. If stick to deploy multiple spark thrift server on one
> machine, then define different SPARK_CONF_DIR, SPARK_LOG_DIR and
> SPARK_PID_DIR for your 2 instances of spark thrift server. Not sure if
> there's other conflicts. but please try first.
>
>
> On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov 
> wrote:
>
>> I get
>>
>> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
>> process 28989.  Stop it first."
>>
>> Is it a bug?
>>
>> 2016-07-01 10:10 GMT-07:00 Jeff Zhang :
>>
>>> I don't think the one instance per machine is true.  As long as you
>>> resolve the conflict issue such as port conflict, pid file, log file and
>>> etc, you can run multiple instances of spark thrift server.
>>>
>>> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov 
>>> wrote:
>>>
 Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
 bother me -

 1) One instance per machine
 2) Yarn client only(not yarn cluster)

 Are there any architectural reasons for such limitations? About
 yarn-client I might understand in theory - master is the same process as a
 server, so it makes some sense, but it's really inconvenient - I need a lot
 of memory on my driver machine. Reasons for one instance per machine I do
 not understand.

 --


 *Sincerely yoursEgor Pakhomov*

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>>
>> *Sincerely yoursEgor Pakhomov*
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 


*Sincerely yoursEgor Pakhomov*


Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
This is not a bug, because these 2 processes use the same SPARK_PID_DIR
which is /tmp by default.  Although you can resolve this by using
different SPARK_PID_DIR, I suspect you would still have other issues like
port conflict. I would suggest you to deploy one spark thrift server per
machine for now. If stick to deploy multiple spark thrift server on one
machine, then define different SPARK_CONF_DIR, SPARK_LOG_DIR and
SPARK_PID_DIR for your 2 instances of spark thrift server. Not sure if
there's other conflicts. but please try first.


On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov 
wrote:

> I get
>
> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
> process 28989.  Stop it first."
>
> Is it a bug?
>
> 2016-07-01 10:10 GMT-07:00 Jeff Zhang :
>
>> I don't think the one instance per machine is true.  As long as you
>> resolve the conflict issue such as port conflict, pid file, log file and
>> etc, you can run multiple instances of spark thrift server.
>>
>> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov 
>> wrote:
>>
>>> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
>>> bother me -
>>>
>>> 1) One instance per machine
>>> 2) Yarn client only(not yarn cluster)
>>>
>>> Are there any architectural reasons for such limitations? About
>>> yarn-client I might understand in theory - master is the same process as a
>>> server, so it makes some sense, but it's really inconvenient - I need a lot
>>> of memory on my driver machine. Reasons for one instance per machine I do
>>> not understand.
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 
Best Regards

Jeff Zhang


Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Takeshi Yamamuro
As said earlier, how about changing a bound port by using env
`HIVE_SERVER2_THRIFT_PORT`?

// maropu

On Fri, Jul 1, 2016 at 10:47 AM, Egor Pahomov 
wrote:

> I get
>
> "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
> process 28989.  Stop it first."
>
> Is it a bug?
>
> 2016-07-01 10:10 GMT-07:00 Jeff Zhang :
>
>> I don't think the one instance per machine is true.  As long as you
>> resolve the conflict issue such as port conflict, pid file, log file and
>> etc, you can run multiple instances of spark thrift server.
>>
>> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov 
>> wrote:
>>
>>> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
>>> bother me -
>>>
>>> 1) One instance per machine
>>> 2) Yarn client only(not yarn cluster)
>>>
>>> Are there any architectural reasons for such limitations? About
>>> yarn-client I might understand in theory - master is the same process as a
>>> server, so it makes some sense, but it's really inconvenient - I need a lot
>>> of memory on my driver machine. Reasons for one instance per machine I do
>>> not understand.
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 
---
Takeshi Yamamuro


Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
I get

"org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as
process 28989.  Stop it first."

Is it a bug?

2016-07-01 10:10 GMT-07:00 Jeff Zhang :

> I don't think the one instance per machine is true.  As long as you
> resolve the conflict issue such as port conflict, pid file, log file and
> etc, you can run multiple instances of spark thrift server.
>
> On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov 
> wrote:
>
>> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really
>> bother me -
>>
>> 1) One instance per machine
>> 2) Yarn client only(not yarn cluster)
>>
>> Are there any architectural reasons for such limitations? About
>> yarn-client I might understand in theory - master is the same process as a
>> server, so it makes some sense, but it's really inconvenient - I need a lot
>> of memory on my driver machine. Reasons for one instance per machine I do
>> not understand.
>>
>> --
>>
>>
>> *Sincerely yoursEgor Pakhomov*
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 


*Sincerely yoursEgor Pakhomov*


Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
Generally there are 2 ways to use a trained pipeline model - (offline)
batch scoring, and real-time online scoring.

For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
certainly loading the model back in Spark and feeding new data through the
pipeline for prediction works just fine, and this is essentially what is
supported in 1.6 (and more or less full coverage in 2.0). For large batch
cases this can be quite efficient.

However, usually for real-time use cases, the latency required is fairly
low - of the order of a few ms to a few 100ms for a request (some examples
include recommendations, ad-serving, fraud detection etc).

In these cases, using Spark has 2 issues: (1) latency for prediction on the
pipeline, which is based on DataFrames and therefore distributed execution,
is usually fairly high "per request"; (2) this requires pulling in all of
Spark for your real-time serving layer (or running a full Spark cluster),
which is usually way too much overkill - all you really need for serving is
a bit of linear algebra and some basic transformations.

So for now, unfortunately there is not much in the way of options for
exporting your pipelines and serving them outside of Spark - the
JPMML-based project mentioned on this thread is one option. The other
option at this point is to write your own export functionality and your own
serving layer.

There is (very initial) movement towards improving the local serving
possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
was the "first step" in this process).

On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski  wrote:

> Hi Rishabh,
>
> I've just today had similar conversation about how to do a ML Pipeline
> deployment and couldn't really answer this question and more because I
> don't really understand the use case.
>
> What would you expect from ML Pipeline model deployment? You can save
> your model to a file by model.write.overwrite.save("model_v1").
>
> model_v1
> |-- metadata
> |   |-- _SUCCESS
> |   `-- part-0
> `-- stages
> |-- 0_regexTok_b4265099cc1c
> |   `-- metadata
> |   |-- _SUCCESS
> |   `-- part-0
> |-- 1_hashingTF_8de997cf54ba
> |   `-- metadata
> |   |-- _SUCCESS
> |   `-- part-0
> `-- 2_linReg_3942a71d2c0e
> |-- data
> |   |-- _SUCCESS
> |   |-- _common_metadata
> |   |-- _metadata
> |   `--
> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
> `-- metadata
> |-- _SUCCESS
> `-- part-0
>
> 9 directories, 12 files
>
> What would you like to have outside SparkContext? What's wrong with
> using Spark? Just curious hoping to understand the use case better.
> Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj 
> wrote:
> > Hi All,
> >
> > I am looking for ways to deploy a ML Pipeline model in production .
> > Spark has already proved to be a one of the best framework for model
> > training and creation, but once the ml pipeline model is ready how can I
> > deploy it outside spark context ?
> > MLlib model has toPMML method but today Pipeline model can not be saved
> to
> > PMML. There are some frameworks like MLeap which are trying to abstract
> > Pipeline Model and provide ML Pipeline Model deployment outside spark
> > context,but currently they don't have most of the ml transformers and
> > estimators.
> > I am looking for related work going on this area.
> > Any pointers will be helpful.
> >
> > Thanks,
> > Rishabh.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Deploying ML Pipeline Model

2016-07-01 Thread Jacek Laskowski
Hi Rishabh,

I've just today had similar conversation about how to do a ML Pipeline
deployment and couldn't really answer this question and more because I
don't really understand the use case.

What would you expect from ML Pipeline model deployment? You can save
your model to a file by model.write.overwrite.save("model_v1").

model_v1
|-- metadata
|   |-- _SUCCESS
|   `-- part-0
`-- stages
|-- 0_regexTok_b4265099cc1c
|   `-- metadata
|   |-- _SUCCESS
|   `-- part-0
|-- 1_hashingTF_8de997cf54ba
|   `-- metadata
|   |-- _SUCCESS
|   `-- part-0
`-- 2_linReg_3942a71d2c0e
|-- data
|   |-- _SUCCESS
|   |-- _common_metadata
|   |-- _metadata
|   `-- part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
`-- metadata
|-- _SUCCESS
`-- part-0

9 directories, 12 files

What would you like to have outside SparkContext? What's wrong with
using Spark? Just curious hoping to understand the use case better.
Thanks.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj  wrote:
> Hi All,
>
> I am looking for ways to deploy a ML Pipeline model in production .
> Spark has already proved to be a one of the best framework for model
> training and creation, but once the ml pipeline model is ready how can I
> deploy it outside spark context ?
> MLlib model has toPMML method but today Pipeline model can not be saved to
> PMML. There are some frameworks like MLeap which are trying to abstract
> Pipeline Model and provide ML Pipeline Model deployment outside spark
> context,but currently they don't have most of the ml transformers and
> estimators.
> I am looking for related work going on this area.
> Any pointers will be helpful.
>
> Thanks,
> Rishabh.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How are threads created in SQL Executor?

2016-07-01 Thread Takeshi Yamamuro
You mean `spark.sql.shuffle.partitions`?
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

// maropu

On Fri, Jul 1, 2016 at 8:42 AM, emiretsk  wrote:

> It seems like threads are created by SQLExecution.withExecutionId, which is
> called inside BroadcastExchangeExec.scala.
> When does the plan executor execute a BroadcaseExchange, and is there a way
> to control the number of threads? We have a job that writes DataFrames to
> an
> external DB, and it seems like each task creates as many threads as there
> are cores.
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-are-threads-created-in-SQL-Executor-tp27260.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro


Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Jeff Zhang
I don't think the one instance per machine is true.  As long as you resolve
the conflict issue such as port conflict, pid file, log file and etc, you
can run multiple instances of spark thrift server.

On Fri, Jul 1, 2016 at 9:32 AM, Egor Pahomov  wrote:

> Hi, I'm using Spark Thrift JDBC server and 2 limitations are really bother
> me -
>
> 1) One instance per machine
> 2) Yarn client only(not yarn cluster)
>
> Are there any architectural reasons for such limitations? About
> yarn-client I might understand in theory - master is the same process as a
> server, so it makes some sense, but it's really inconvenient - I need a lot
> of memory on my driver machine. Reasons for one instance per machine I do
> not understand.
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>



-- 
Best Regards

Jeff Zhang


Re: Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread ppatierno


In my project I found the library which brings Jackson core 2.6.5 and it is 
used in conjunction with the requested Jackson scala module 2.5.3 wanted by 
spark 2.0.0 preview. At runtime it's the cause of exception.


Now I have excluded 2.6.5 using sbt but it could be dangerous for the other 
library.


Why this regression to Jackson 2.5.3 switching from 2.0.0 snapshot to preview ?


Thanks

Paolo


Get Outlook for Android






On Fri, Jul 1, 2016 at 4:24 PM +0200, "Paolo Patierno"  
wrote:





Hi,

developing a custom receiver up today I used spark version "2.0.0-SNAPSHOT" and 
scala version 2.11.7.
With these version all tests work fine.

I have just switching to "2.0.0-preview" as spark version but not I have 
following error :

An exception or error caused a run to abort: class 
com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final 
method 
withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
java.lang.VerifyError: class 
com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final 
method 
withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
at java.lang.ClassLoader.defineClass1(Native Method)

I see that using 2.0.0-SNAPSHOT the jackson core version is 2.6.5 ... now 2.6.1 
with module scala 2.5.3.

Thanks,
Paolo.

Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor
Twitter : @ppatierno
Linkedin : paolopatierno
Blog : DevExperience


Cluster mode deployment from jar in S3

2016-07-01 Thread Ashic Mahtab
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit 
jobs using "--deploy-mode client", however using "--deploy-mode cluster" is 
proving to be a challenge. I've tries this:
spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster 
s3://bucket/dir/foo.jar
When I do this, I get:

16/07/01 16:23:16 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).java.lang.IllegalArgumentException: AWS Access Key ID and Secret 
Access Key must be specified as the username or password (respectively) of a s3 
URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey 
properties (respectively).at 
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)

Now I'm not using any S3 or hadoop stuff within my code (it's just an 
sc.parallelize(1 to 100)). So, I imagine it's the driver trying to fetch the 
jar. I haven't set the AWS Access Key Id and Secret as mentioned, but the role 
the machine's are in allow them to copy the jar. In other words, this works:
aws s3 cp s3://bucket/dir/foo.jar /tmp/foo.jar
I'm using Spark 1.6.2, and can't really think of what I can do so that I can 
submit the jar from s3 using cluster deploy mode. I've also tried simply 
downloading the jar onto a node, and spark-submitting that... that works in 
client mode, but I get a not found error when using cluster mode.
Any help will be appreciated.
Thanks,Ashic. 

Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Egor Pahomov
Hi, I'm using Spark Thrift JDBC server and 2 limitations are really bother
me -

1) One instance per machine
2) Yarn client only(not yarn cluster)

Are there any architectural reasons for such limitations? About yarn-client
I might understand in theory - master is the same process as a server, so
it makes some sense, but it's really inconvenient - I need a lot of memory
on my driver machine. Reasons for one instance per machine I do not
understand.

-- 


*Sincerely yoursEgor Pakhomov*


Re: Random Forest Classification

2016-07-01 Thread Rich Tarro
Hi Bryan.

Thanks for your continued help.

Here is the code shown in a Jupyter notebook. I figured this was easier
that cutting and pasting the code into an email. If you  would like me to
send you the code in a different format let, me know. The necessary data is
all downloaded within the notebook itself.

https://console.ng.bluemix.net/data/notebooks/fe7e578a-401f-4744-a318-b1b6bcf6f5f8/view?access_token=2f6df7b1dfcb3c1c2d94a794506bb282729dab8f05118fafe5f11886326e02fc

A few additional pieces of information.

1. The training dataset is cached before training the model. If you do not
cache the training dataset, the model will not train. The code
model.transform(test) fails with a similar error. No other changes besides
caching or not caching. Again, with the training dataset cached, the model
can be successfully trained as seen in the notebook.

2. I have another version of the notebook where I download the same data in
libsvm format rather than csv. That notebook works fine. All the code is
essentially the same accounting for the difference in file formats.

3. I tested this same code on another Spark cloud platform and it displays
the same symptoms when run there.

Thanks.
Rich


On Wed, Jun 29, 2016 at 12:59 AM, Bryan Cutler  wrote:

> Are you fitting the VectorIndexer to the entire data set and not just
> training or test data?  If you are able to post your code and some data to
> reproduce, that would help in troubleshooting.
>
> On Tue, Jun 28, 2016 at 4:40 PM, Rich Tarro  wrote:
>
>> Thanks for the response, but in my case I reversed the meaning of
>> "prediction" and "predictedLabel". It seemed to make more sense to me that
>> way, but in retrospect, it probably only causes confusion to anyone else
>> looking at this. I reran the code with all the pipeline stage inputs and
>> outputs named exactly as in the Random Forest Classifier example to make
>> sure I hadn't messed anything up when I renamed things. Same error.
>>
>> I'm still at the point where I can train the model and make predictions,
>> but not able to get the MulticlassClassificationEvaluator to work on the
>> DataFrame of predictions.
>>
>> Any other suggestions? Thanks.
>>
>>
>>
>> On Tue, Jun 28, 2016 at 4:21 PM, Rich Tarro  wrote:
>>
>>> I created a ML pipeline using the Random Forest Classifier - similar to
>>> what is described here except in my case the source data is in csv format
>>> rather than libsvm.
>>>
>>>
>>> https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
>>>
>>> I am able to successfully train the model and make predictions (on test
>>> data not used to train the model) as shown here.
>>>
>>> ++--+-+--++
>>> |indexedLabel|predictedLabel|label|prediction|features|
>>> ++--+-+--++
>>> | 4.0|   4.0|0| 0|(784,[124,125,126...|
>>> | 2.0|   2.0|3| 3|(784,[119,120,121...|
>>> | 8.0|   8.0|8| 8|(784,[180,181,182...|
>>> | 0.0|   0.0|1| 1|(784,[154,155,156...|
>>> | 3.0|   8.0|2| 8|(784,[148,149,150...|
>>> ++--+-+--++
>>> only showing top 5 rows
>>>
>>> However, when I attempt to calculate the error between the indexedLabel and 
>>> the precictedLabel using the MulticlassClassificationEvaluator, I get the 
>>> NoSuchElementException error attached below.
>>>
>>> val evaluator = new 
>>> MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("predictedLabel").setMetricName("precision")
>>> val accuracy = evaluator.evaluate(predictions)
>>> println("Test Error = " + (1.0 - accuracy))
>>>
>>> What could be the issue?
>>>
>>>
>>>
>>> Name: org.apache.spark.SparkException
>>> Message: Job aborted due to stage failure: Task 2 in stage 49.0 failed 10 
>>> times, most recent failure: Lost task 2.9 in stage 49.0 (TID 162, 
>>> yp-spark-dal09-env5-0024): java.util.NoSuchElementException: key not found: 
>>> 132.0
>>> at scala.collection.MapLike$class.default(MapLike.scala:228)
>>> at scala.collection.AbstractMap.default(Map.scala:58)
>>> at scala.collection.MapLike$class.apply(MapLike.scala:141)
>>> at scala.collection.AbstractMap.apply(Map.scala:58)
>>> at 
>>> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:331)
>>> at 
>>> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:309)
>>> at 
>>> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
>>> at 
>>> org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:351)
>>> at 
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>>>  

How are threads created in SQL Executor?

2016-07-01 Thread emiretsk
It seems like threads are created by SQLExecution.withExecutionId, which is
called inside BroadcastExchangeExec.scala. 
When does the plan executor execute a BroadcaseExchange, and is there a way
to control the number of threads? We have a job that writes DataFrames to an
external DB, and it seems like each task creates as many threads as there
are cores. 






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-are-threads-created-in-SQL-Executor-tp27260.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Deploying ML Pipeline Model

2016-07-01 Thread Silvio Fiorito
Hi Rishabh,

My colleague, Richard Garris from Databricks, actually just gave a talk last 
night at the Bay Area Spark Meetup on ML model deployment. The slides and 
recording should be up soon, you should be able to find a link here: 
http://www.meetup.com/spark-users/events/231574440/

Thanks,
Silvio

From: Rishabh Bhardwaj 
Date: Friday, July 1, 2016 at 7:54 AM
To: user 
Cc: "d...@spark.apache.org" 
Subject: Deploying ML Pipeline Model

Hi All,

I am looking for ways to deploy a ML Pipeline model in production .
Spark has already proved to be a one of the best framework for model training 
and creation, but once the ml pipeline model is ready how can I deploy it 
outside spark context ?
MLlib model has toPMML method but today Pipeline model can not be saved to 
PMML. There are some frameworks like MLeap which are trying to abstract 
Pipeline Model and provide ML Pipeline Model deployment outside spark 
context,but currently they don't have most of the ml transformers and 
estimators.
I am looking for related work going on this area.
Any pointers will be helpful.

Thanks,
Rishabh.


Re: HiveContext

2016-07-01 Thread Mich Talebzadeh
hi,

In general if your ORC tables is not bucketed it is not going to do much.

the idea is that using predicate pushdown you will only get the data from
the partition concerned and avoid expensive table scans!

Orc provides what is known as store index at file, stripe and rowset levels
(default 10K rows). That is just statistics for min, avg and max for each
column.

Now going back to practicality, you can do a simple test. log in to hive
and run your query with EXPLAIN EXTENDED  select ... and see what you see.

then try it from Spark. As far as I am aware Spark will not rely on
anythinh Hive wise, except the metadata. it will use DAG and in-memory
capability to do the query.

just try it and see.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 July 2016 at 11:00, manish jaiswal  wrote:

> Hi,
>
> Using sparkHiveContext when we read all rows where age was between 0 and
> 100, even though we requested rows where age was less than 15. Such full
> table scanning is an expensive operation.
>
> ORC avoids this type of overhead by using predicate push-down with three
> levels of built-in indexes within each file: file level, stripe level, and
> row level:
>
>-
>
>File and stripe level statistics are in the file footer, making it
>easy to determine if the rest of the file needs to be read.
>-
>
>Row level indexes include column statistics for each row group and
>position, for seeking to the start of the row group.
>
> ORC utilizes these indexes to move the filter operation to the data
> loading phase, by reading only data that potentially includes required rows.
>
>
> My doubt is when we give some query to hiveContext in orc table using
> spark with
>
> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
>
> how it will perform
>
> 1.it will fetch only those record from orc file according to query.or
>
> 2.it will take orc file in spark and then perform spark job using predicate 
> push-down
>
> and give you the records.
>
> (I am aware of hiveContext gives spark only metadata and location of the data)
>
>
> Thanks
>
> Manish
>
>


Spark 2.0.0-preview ... problem with jackson core version

2016-07-01 Thread Paolo Patierno
Hi,

developing a custom receiver up today I used spark version "2.0.0-SNAPSHOT" and 
scala version 2.11.7.
With these version all tests work fine.

I have just switching to "2.0.0-preview" as spark version but not I have 
following error :

An exception or error caused a run to abort: class 
com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final 
method 
withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
 
java.lang.VerifyError: class 
com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final 
method 
withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
at java.lang.ClassLoader.defineClass1(Native Method)

I see that using 2.0.0-SNAPSHOT the jackson core version is 2.6.5 ... now 2.6.1 
with module scala 2.5.3.

Thanks,
Paolo.

Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor 
Twitter : @ppatierno
Linkedin : paolopatierno
Blog : DevExperience  

Re: Deploying ML Pipeline Model

2016-07-01 Thread Steve Goodman
Hi Rishabh,

I have a similar use-case and have struggled to find the best solution. As
I understand it 1.6 provides pipeline persistence in Scala, and that will
be expanded in 2.x. This project https://github.com/jpmml/jpmml-sparkml
claims to support about a dozen pipeline transformers, and 6 or 7 different
model types, although I have not yet used it myself.

Looking forward to hearing better suggestions?

Steve


On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj 
wrote:

> Hi All,
>
> I am looking for ways to deploy a ML Pipeline model in production .
> Spark has already proved to be a one of the best framework for model
> training and creation, but once the ml pipeline model is ready how can I
> deploy it outside spark context ?
> MLlib model has toPMML method but today Pipeline model can not be saved to
> PMML. There are some frameworks like MLeap which are trying to abstract
> Pipeline Model and provide ML Pipeline Model deployment outside spark
> context,but currently they don't have most of the ml transformers and
> estimators.
> I am looking for related work going on this area.
> Any pointers will be helpful.
>
> Thanks,
> Rishabh.
>


Re: RDD to DataFrame question with JsValue in the mix

2016-07-01 Thread Dood

On 7/1/2016 6:42 AM, Akhil Das wrote:

case class Holder(str: String, js:JsValue)


Hello,

Thanks!

I tried that before posting the question to the list but I keep getting 
an error such as this even after the map() operation to convert 
(String,JsValue) -> Holder and then toDF().


I am simply invoking the following:

val rddDF:DataFrame = rdd.map(x => Holder(x._1,x._2)).toDF
rddDF.registerTempTable("rddf")

rddDF.schema.mkString(",")


And getting the following:

[2016-07-01 11:57:02,720] WARN  .jobserver.JobManagerActor [] 
[akka://JobServer/user/context-supervisor/test] - Exception from job 
d4c9d145-92bf-4c64-8904-91c917bd61d3:
java.lang.UnsupportedOperationException: Schema for type 
play.api.libs.json.JsValue is not supported
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:718)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:693)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:691)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:691)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at 
org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94)




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Deploying ML Pipeline Model

2016-07-01 Thread Rishabh Bhardwaj
Hi All,

I am looking for ways to deploy a ML Pipeline model in production .
Spark has already proved to be a one of the best framework for model
training and creation, but once the ml pipeline model is ready how can I
deploy it outside spark context ?
MLlib model has toPMML method but today Pipeline model can not be saved to
PMML. There are some frameworks like MLeap which are trying to abstract
Pipeline Model and provide ML Pipeline Model deployment outside spark
context,but currently they don't have most of the ml transformers and
estimators.
I am looking for related work going on this area.
Any pointers will be helpful.

Thanks,
Rishabh.


Re: JavaStreamingContext.stop() hangs

2016-07-01 Thread chandan prakash
http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html

On Fri, Jul 1, 2016 at 1:42 PM, manoop  wrote:

> I have a Spark job and I just want to stop it on some condition. Once the
> condition is met, I am calling JavaStreamingContext.stop(), but it just
> hangs. Does not move on to the next line, which is just a debug line. I
> expect it to come out.
>
> I already tried different variants of stop, that is, passing true to stop
> the spark context, etc. but nothing is working out.
>
> Here's the sample code:
>
> LOGGER.debug("Stop? {}", stop);
> if (stop) {
> jssc.stop(false, true);
> LOGGER.debug("STOPPED!");
> }
>
> I am using Spark 1.5.2. Any help / pointers would be appreciated.
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/JavaStreamingContext-stop-hangs-tp27257.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Chandan Prakash


Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das
Can you try the Cassandra connector 1.5? It is also compatible with Spark
1.6 according to their documentation
https://github.com/datastax/spark-cassandra-connector#version-compatibility
You can also crosspost it over here
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user

On Fri, Jul 1, 2016 at 5:45 PM, Joaquin Alzola 
wrote:

> HI Akhil
>
>
>
> I am using:
>
> Cassandra: 3.0.5
>
> Spark: 1.6.1
>
> Scala 2.10
>
> Spark-cassandra connector: 1.6.0
>
>
>
> *From:* Akhil Das [mailto:ak...@hacked.work]
> *Sent:* 01 July 2016 11:38
> *To:* Joaquin Alzola 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Remote RPC client disassociated
>
>
>
> This looks like a version conflict, which version of spark are you using?
> The Cassandra connector you are using is for Scala 2.10x and Spark 1.6
> version.
>
>
>
> On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola 
> wrote:
>
> HI List,
>
>
>
> I am launching this spark-submit job:
>
>
>
> hadoop@testbedocg:/mnt/spark> bin/spark-submit --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.6.0 --jars
> /mnt/spark/lib/TargetHolding_pyspark-cassandra-0.3.5.jar spark_v2.py
>
>
>
> spark_v2.py is:
>
> from pyspark_cassandra import CassandraSparkContext, Row
>
> from pyspark import SparkContext, SparkConf
>
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("test").setMaster("spark://
> 192.168.23.31:7077").set("spark.cassandra.connection.host",
> "192.168.23.31")
>
> sc = CassandraSparkContext(conf=conf)
>
> table =
> sc.cassandraTable("lebara_diameter_codes","nl_lebara_diameter_codes")
>
> food_count = table.select("errorcode2001").groupBy("errorcode2001").count()
>
> food_count.collect()
>
>
>
>
>
> Error I get when running the above command:
>
>
>
> [Stage 0:>  (0 +
> 3) / 7]16/06/30 10:40:36 ERROR TaskSchedulerImpl: Lost executor 0 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 7) / 7]16/06/30 10:40:40 ERROR TaskSchedulerImpl: Lost executor 1 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 5) / 7]16/06/30 10:40:42 ERROR TaskSchedulerImpl: Lost executor 3 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 4) / 7]16/06/30 10:40:46 ERROR TaskSchedulerImpl: Lost executor 4 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> 16/06/30 10:40:46 ERROR TaskSetManager: Task 5 in stage 0.0 failed 4
> times; aborting job
>
> Traceback (most recent call last):
>
>   File "/mnt/spark-1.6.1-bin-hadoop2.6/spark_v2.py", line 11, in 
>
> food_count =
> table.select("errorcode2001").groupBy("errorcode2001").count()
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in
> count
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in
> fold
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in
> collect
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 813, in __call__
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line
> 308, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage
> 0.0 (TID 14, as4): ExecutorLostFailure (executor 4 exited caused by one of
> the running tasks) Reason: Remote RPC client disassociated. Likely due to
> containers exceeding thresholds, or network issues. Check driver logs for
> WARN messages.
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> 

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-01 Thread Akhil Das
You can use this https://github.com/wurstmeister/kafka-docker to spin up a
kafka cluster and then point your sparkstreaming to it to consume from it.

On Fri, Jul 1, 2016 at 1:19 AM, SRK  wrote:

> Hi,
>
> I need to do integration tests using Spark Streaming. My idea is to spin up
> kafka using docker locally and use it to feed the stream to my Streaming
> Job. Any suggestions on how to do this would be of great help.
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cheers!


RE: Remote RPC client disassociated

2016-07-01 Thread Joaquin Alzola
HI Akhil

I am using:
Cassandra: 3.0.5
Spark: 1.6.1
Scala 2.10
Spark-cassandra connector: 1.6.0

From: Akhil Das [mailto:ak...@hacked.work]
Sent: 01 July 2016 11:38
To: Joaquin Alzola 
Cc: user@spark.apache.org
Subject: Re: Remote RPC client disassociated

This looks like a version conflict, which version of spark are you using? The 
Cassandra connector you are using is for Scala 2.10x and Spark 1.6 version.

On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola 
> wrote:
HI List,

I am launching this spark-submit job:

hadoop@testbedocg:/mnt/spark> bin/spark-submit --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.6.0 --jars 
/mnt/spark/lib/TargetHolding_pyspark-cassandra-0.3.5.jar spark_v2.py

spark_v2.py is:
from pyspark_cassandra import CassandraSparkContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = 
SparkConf().setAppName("test").setMaster("spark://192.168.23.31:7077").set("spark.cassandra.connection.host",
 "192.168.23.31")
sc = CassandraSparkContext(conf=conf)
table = sc.cassandraTable("lebara_diameter_codes","nl_lebara_diameter_codes")
food_count = table.select("errorcode2001").groupBy("errorcode2001").count()
food_count.collect()


Error I get when running the above command:

[Stage 0:>  (0 + 3) / 
7]16/06/30 10:40:36 ERROR TaskSchedulerImpl: Lost executor 0 on as5: Remote RPC 
client disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.
[Stage 0:>  (0 + 7) / 
7]16/06/30 10:40:40 ERROR TaskSchedulerImpl: Lost executor 1 on as4: Remote RPC 
client disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.
[Stage 0:>  (0 + 5) / 
7]16/06/30 10:40:42 ERROR TaskSchedulerImpl: Lost executor 3 on as5: Remote RPC 
client disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.
[Stage 0:>  (0 + 4) / 
7]16/06/30 10:40:46 ERROR TaskSchedulerImpl: Lost executor 4 on as4: Remote RPC 
client disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.
16/06/30 10:40:46 ERROR TaskSetManager: Task 5 in stage 0.0 failed 4 times; 
aborting job
Traceback (most recent call last):
  File "/mnt/spark-1.6.1-bin-hadoop2.6/spark_v2.py", line 11, in 
food_count = table.select("errorcode2001").groupBy("errorcode2001").count()
  File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in count
  File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum
  File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in fold
  File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect
  File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, 
in __call__
  File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in 
get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 
14, as4): ExecutorLostFailure (executor 4 exited caused by one of the running 
tasks) Reason: Remote RPC client disassociated. Likely due to containers 
exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 

Re: RDD to DataFrame question with JsValue in the mix

2016-07-01 Thread Akhil Das
Something like this?

import sqlContext.implicits._
case class Holder(str: String, js:JsValue)

yourRDD.map(x => Holder(x._1, x._2)).toDF()



On Fri, Jul 1, 2016 at 3:36 AM, Dood@ODDO  wrote:

> Hello,
>
> I have an RDD[(String,JsValue)] that I want to convert into a DataFrame
> and then run SQL on. What is the easiest way to get the JSON (in form of
> JsValue) "understood" by the process?
>
> Thanks!
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cheers!


Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das
This looks like a version conflict, which version of spark are you using?
The Cassandra connector you are using is for Scala 2.10x and Spark 1.6
version.

On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola 
wrote:

> HI List,
>
>
>
> I am launching this spark-submit job:
>
>
>
> hadoop@testbedocg:/mnt/spark> bin/spark-submit --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.6.0 --jars
> /mnt/spark/lib/TargetHolding_pyspark-cassandra-0.3.5.jar spark_v2.py
>
>
>
> spark_v2.py is:
>
> from pyspark_cassandra import CassandraSparkContext, Row
>
> from pyspark import SparkContext, SparkConf
>
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("test").setMaster("spark://
> 192.168.23.31:7077").set("spark.cassandra.connection.host",
> "192.168.23.31")
>
> sc = CassandraSparkContext(conf=conf)
>
> table =
> sc.cassandraTable("lebara_diameter_codes","nl_lebara_diameter_codes")
>
> food_count = table.select("errorcode2001").groupBy("errorcode2001").count()
>
> food_count.collect()
>
>
>
>
>
> Error I get when running the above command:
>
>
>
> [Stage 0:>  (0 +
> 3) / 7]16/06/30 10:40:36 ERROR TaskSchedulerImpl: Lost executor 0 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 7) / 7]16/06/30 10:40:40 ERROR TaskSchedulerImpl: Lost executor 1 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 5) / 7]16/06/30 10:40:42 ERROR TaskSchedulerImpl: Lost executor 3 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 4) / 7]16/06/30 10:40:46 ERROR TaskSchedulerImpl: Lost executor 4 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> 16/06/30 10:40:46 ERROR TaskSetManager: Task 5 in stage 0.0 failed 4
> times; aborting job
>
> Traceback (most recent call last):
>
>   File "/mnt/spark-1.6.1-bin-hadoop2.6/spark_v2.py", line 11, in 
>
> food_count =
> table.select("errorcode2001").groupBy("errorcode2001").count()
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in
> count
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in
> fold
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in
> collect
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 813, in __call__
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line
> 308, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage
> 0.0 (TID 14, as4): ExecutorLostFailure (executor 4 exited caused by one of
> the running tasks) Reason: Remote RPC client disassociated. Likely due to
> containers exceeding thresholds, or network issues. Check driver logs for
> WARN messages.
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
>
>at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
>
> at
> 

HiveContext

2016-07-01 Thread manish jaiswal
Hi,

Using sparkHiveContext when we read all rows where age was between 0 and
100, even though we requested rows where age was less than 15. Such full
table scanning is an expensive operation.

ORC avoids this type of overhead by using predicate push-down with three
levels of built-in indexes within each file: file level, stripe level, and
row level:

   -

   File and stripe level statistics are in the file footer, making it easy
   to determine if the rest of the file needs to be read.
   -

   Row level indexes include column statistics for each row group and
   position, for seeking to the start of the row group.

ORC utilizes these indexes to move the filter operation to the data loading
phase, by reading only data that potentially includes required rows.


My doubt is when we give some query to hiveContext in orc table using spark
with

sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

how it will perform

1.it will fetch only those record from orc file according to query.or

2.it will take orc file in spark and then perform spark job using
predicate push-down

and give you the records.

(I am aware of hiveContext gives spark only metadata and location of the data)


Thanks

Manish


Re: How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

2016-07-01 Thread Mich Talebzadeh
Let us take this for a ride.

Simple code. Reads from an existing of 22miilion rows stored as ORC and
saves it as a Parquet

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("use oraclehadoop")
val s = HiveContext.table("sales2")
val sorted = s.sort("prod_id","cust_id","time_id","channel_id","promo_id")
sorted.count
sorted.save("oraclehadoop.sales3")

It will store it on hdfs in this case, under directory
/user/hduser/oraclehadoop.sales3/

Note the subdirectory corresponds to save()

It is saved by Spark and the number of partitions I gather is determined by
Spark code (I don't know the content)

The sub-directory owner on hdfs will be /user/ by default
and if you list the partitions it will show something like below:

-rw-r--r--   2 hduser supergroup  0 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_SUCCESS
-rw-r--r--   2 hduser supergroup743 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_common_metadata
-rw-r--r--   2 hduser supergroup 182639 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_metadata
-rw-r--r--   2 hduser supergroup  22962 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-0-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  25698 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-1-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  17210 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-2-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  22398 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-3-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  18105 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-4-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet

Note that metadata is also stored on the file together with data partitions
(zipped). This is in contrast to Hive where metadata is stored in Hive
metastore.

In this case it decides to have 200 data partitions.

If you want more control of it, you can get the data as DF, register it as
tempTable, create the table the way you like it and do an insert/select
from tempTable.

I personally prefer to create the table myself in a format that I like
(liker ORC below) and store it in Hive.

Example

var sqltext: String = ""
sqltext =
"""
 CREATE TABLE IF NOT EXISTS oraclehadoop.sales3
 (
  PROD_IDbigint   ,
  CUST_IDbigint   ,
  TIME_IDtimestamp,
  CHANNEL_ID bigint   ,
  PROMO_ID   bigint   ,
  QUANTITY_SOLD  decimal(10)  ,
  AMOUNT_SOLDdecimal(10)
)
CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1")
"""
HiveContext.sql(sqltext)
sorted.registerTempTable("tmp")
sqltext =
"""
INSERT INTO
oraclehadoop.sales3
SELECT * FROM tmp
"""
HiveContext.sql(sqltext)


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 July 2016 at 08:43, shiv4nsh  wrote:

> Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query
> using
> the SQLContext and when I run the insert query it saves the data in
> partition (as expected).
>
> I am just curious and want to know how these partitions are made and how
> the
> permissions to these partition is assigned . Can we change it? Does it
> behave differently on hdfs.?
>
> If someone can point me to the exact code in spark  that would be
> beneficial.
>
> I have also posted it on  stackOverflow.
> <
> http://stackoverflow.com/questions/38138113/how-spark-makes-partition-when-we-insert-data-using-the-sql-query-and-how-the-p
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-makes-partition-when-we-insert-data-using-the-Sql-query-and-how-the-permissions-to-the-par-tp27256.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


JavaStreamingContext.stop() hangs

2016-07-01 Thread manoop
I have a Spark job and I just want to stop it on some condition. Once the
condition is met, I am calling JavaStreamingContext.stop(), but it just
hangs. Does not move on to the next line, which is just a debug line. I
expect it to come out.

I already tried different variants of stop, that is, passing true to stop
the spark context, etc. but nothing is working out.

Here's the sample code:

LOGGER.debug("Stop? {}", stop);
if (stop) {
jssc.stop(false, true);
LOGGER.debug("STOPPED!");
}

I am using Spark 1.5.2. Any help / pointers would be appreciated.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JavaStreamingContext-stop-hangs-tp27257.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

2016-07-01 Thread shiv4nsh
Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query using
the SQLContext and when I run the insert query it saves the data in
partition (as expected).

I am just curious and want to know how these partitions are made and how the
permissions to these partition is assigned . Can we change it? Does it
behave differently on hdfs.?

If someone can point me to the exact code in spark  that would be
beneficial.

I have also posted it on  stackOverflow.

  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-makes-partition-when-we-insert-data-using-the-Sql-query-and-how-the-permissions-to-the-par-tp27256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org