RE: SparkML algos limitations question.

2016-01-04 Thread Ulanov, Alexander
Hi Yanbo,

As long as two models fit into memory of a single machine, there should be no 
problems, so even 16GB machines can handle large models. (master should have 
more memory because it runs LBFGS) In my experiments, I’ve trained the models 
12M and 32M parameters without issues.

Best regards, Alexander

From: Yanbo Liang [mailto:yblia...@gmail.com]
Sent: Sunday, December 27, 2015 2:23 AM
To: Joseph Bradley
Cc: Eugene Morozov; user; d...@spark.apache.org
Subject: Re: SparkML algos limitations question.

Hi Eugene,

AFAIK, the current implementation of MultilayerPerceptronClassifier have some 
scalability problems if the model is very huge (such as >10M), although I think 
the limitation can cover many use cases already.

Yanbo

2015-12-16 6:00 GMT+08:00 Joseph Bradley 
>:
Hi Eugene,

The maxDepth parameter exists because the implementation uses Integer node IDs 
which correspond to positions in the binary tree.  This simplified the 
implementation.  I'd like to eventually modify it to avoid depending on tree 
node IDs, but that is not yet on the roadmap.

There is not an analogous limit for the GLMs you listed, but I'm not very 
familiar with the perceptron implementation.

Joseph

On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov 
> wrote:
Hello!

I'm currently working on POC and try to use Random Forest (classification and 
regression). I also have to check SVM and Multiclass perceptron (other algos 
are less important at the moment). So far I've discovered that Random Forest 
has a limitation of maxDepth for trees and just out of curiosity I wonder why 
such a limitation has been introduced?

An actual question is that I'm going to use Spark ML in production next year 
and would like to know if there are other limitations like maxDepth in RF for 
other algorithms: Logistic Regression, Perceptron, SVM, etc.

Thanks in advance for your time.
--
Be well!
Jean Morozov




Re: Does state survive application restart in StatefulNetworkWordCount?

2016-01-04 Thread Tathagata Das
It does get recovered if you restart from checkpoints. See the example
RecoverableNetworkWordCount.scala

On Sat, Jan 2, 2016 at 6:22 AM, Rado Buranský 
wrote:

> I am trying to understand how state in Spark Streaming works in general.
> If I run this example program twice will the second run see state from the
> first run?
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
>
> It seems no. Is there a way how to achieve this? I am thinking about
> redeploying an application an I would like not to loose the current state.
>
> Thanks
>


Re: Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Michael Segel
Its been a while... but this isn’t a spark issue. 

A spark job on YARN runs as a regular job. 
What happens when you run a regular M/R job by that user? 

I don’t think we did anything special...



> On Jan 4, 2016, at 12:22 PM, Mike Wright  > wrote:
> 
> Has anyone used Spark Job Server on a "kerberized" cluster in YARN-Client 
> mode? When Job Server contacts the YARN resource manager, we see a "Cannot 
> impersonate root" error and am not sure what we have misconfigured.
> 
> Thanks.
> 
> ___
> 
> Mike Wright
> Principal Architect, Software Engineering
> S Capital IQ and SNL
> 
> 434-951-7816 p
> 434-244-4466 f
> 540-470-0119 m
> 
> mwri...@snl.com 
> 
> 



RE: Is Spark 1.6 released?

2016-01-04 Thread Saif.A.Ellafi
Where can I read more about the dataset api on a user layer? I am failing to 
get an API doc or understand when to use DataFrame or DataSet, advantages, etc.

Thanks,
Saif

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] 
Sent: Monday, January 04, 2016 2:01 PM
To: user@spark.apache.org
Subject: Re: Is Spark 1.6 released?

It's now OK: Michael published and announced the release.

Sorry for the delay.

Regards
JB

On 01/04/2016 10:06 AM, Jung wrote:
> Hi
> There were Spark 1.6 jars in maven central and github.
> I found it 5 days ago. But it doesn't appear on Spark website now.
> May I regard Spark 1.6 zip file in github as a stable release?
>
> Thanks
> Jung
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming Application is Stuck Under Heavy Load Due to DeadLock

2016-01-04 Thread Shixiong Zhu
Hye Rachana, could you provide the full jstack outputs? Maybe it's same as
https://issues.apache.org/jira/browse/SPARK-11104

Best Regards,
Shixiong Zhu

2016-01-04 12:56 GMT-08:00 Rachana Srivastava <
rachana.srivast...@markmonitor.com>:

> Hello All,
>
>
>
> I am running my application on Spark cluster but under heavy load the
> system is hung due to deadlock.  I found similar issues resolved here
> https://datastax-oss.atlassian.net/browse/JAVA-555 in  Spark version
> 2.1.3.  But I am running on Spark 1.3 still getting the same issue.
>
>
>
> Here is the stack trace for reference:
>
>
>
> sun.misc.Unsafe.park(Native Method)
>
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>
>
> org.apache.spark.streaming.ContextWaiter.waitForStopOrError(ContextWaiter.scala:63)
>
>
> org.apache.spark.streaming.StreamingContext.awaitTermination(StreamingContext.scala:521)
>
>
> org.apache.spark.streaming.api.java.JavaStreamingContext.awaitTermination(JavaStreamingContext.scala:592)
>
>
>
> sun.misc.Unsafe.park(Native Method)
>
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>
> java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>
>
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:62)
>
>
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
>
>
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
>
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
>
>
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)
>
>
>
> sun.misc.Unsafe.park(Native Method)
>
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
>
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
>
>
> org.spark-project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
>
>
> org.spark-project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
>
>
> org.spark-project.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
>
>
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
>
> java.lang.Thread.run(Thread.java:745)
>
>
>
>
>
> Thanks,
>
>
>
> Rachana
>
>
>


Re: HiveThriftServer fails to quote strings

2016-01-04 Thread Ted Yu
bq. without any of the escape characters:

Did you intend to show some sample ?

As far as I can tell, there was no sample or image in previous email.

FYI

On Mon, Jan 4, 2016 at 11:36 AM, sclyon  wrote:

> Hello all,
>
> I've got a nested JSON structure in parquet format that I'm having some
> issues with when trying to query it through Hive.
>
> In Spark (1.5.2) the column is represented correctly:
>
>
> However, when queried from Hive I get the same column but without any of
> the
> escape characters:
>
>
> Naturally this breaks my JSON parsers and I'm unable to use the data. Has
> anyone encountered this error before? I tried looking through the source
> but
> all I can find that I think is related is the HiveContext.toHiveString
> method.
>
> Any advice would be appreciated!
>
> Thanks,
> Scott
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-fails-to-quote-strings-tp25877.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi,

I’m looking for a way to compare subsets of an RDD intelligently.

 Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?

The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()


def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}

Val keyPairs:(Int, Int) // all key pairs

Val rddPairs = keyPairs.map{

case (a, b) =>

filterSubset(a,r).cartesian(filterSubset(b,r))

}

rddPairs.map{whatever I want to compare…}



I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.



What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?


Thank you for your help and apologies if this email sends more than once
(I'm having some issues with the mailing list)


Re: Is Spark 1.6 released?

2016-01-04 Thread Ted Yu
Please refer to the following:

https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets
https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets
https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Cheers

On Mon, Jan 4, 2016 at 11:59 AM,  wrote:

> Where can I read more about the dataset api on a user layer? I am failing
> to get an API doc or understand when to use DataFrame or DataSet,
> advantages, etc.
>
> Thanks,
> Saif
>
> -Original Message-
> From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> Sent: Monday, January 04, 2016 2:01 PM
> To: user@spark.apache.org
> Subject: Re: Is Spark 1.6 released?
>
> It's now OK: Michael published and announced the release.
>
> Sorry for the delay.
>
> Regards
> JB
>
> On 01/04/2016 10:06 AM, Jung wrote:
> > Hi
> > There were Spark 1.6 jars in maven central and github.
> > I found it 5 days ago. But it doesn't appear on Spark website now.
> > May I regard Spark 1.6 zip file in github as a stable release?
> >
> > Thanks
> > Jung
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is Spark 1.6 released?

2016-01-04 Thread Michael Armbrust
I also wrote about it here:
https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

And put together a bunch of examples here:
https://docs.cloud.databricks.com/docs/spark/1.6/index.html

On Mon, Jan 4, 2016 at 12:02 PM, Annabel Melongo <
melongo_anna...@yahoo.com.invalid> wrote:

> [1] http://spark.apache.org/releases/spark-release-1-6-0.html
> [2] http://spark.apache.org/downloads.html
>
>
>
> On Monday, January 4, 2016 2:59 PM, "saif.a.ell...@wellsfargo.com" <
> saif.a.ell...@wellsfargo.com> wrote:
>
>
> Where can I read more about the dataset api on a user layer? I am failing
> to get an API doc or understand when to use DataFrame or DataSet,
> advantages, etc.
>
> Thanks,
> Saif
>
> -Original Message-
> From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> Sent: Monday, January 04, 2016 2:01 PM
> To: user@spark.apache.org
> Subject: Re: Is Spark 1.6 released?
>
> It's now OK: Michael published and announced the release.
>
> Sorry for the delay.
>
> Regards
> JB
>
> On 01/04/2016 10:06 AM, Jung wrote:
> > Hi
> > There were Spark 1.6 jars in maven central and github.
> > I found it 5 days ago. But it doesn't appear on Spark website now.
> > May I regard Spark 1.6 zip file in github as a stable release?
> >
> > Thanks
> > Jung
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


HiveThriftServer fails to quote strings

2016-01-04 Thread sclyon
Hello all,

I've got a nested JSON structure in parquet format that I'm having some
issues with when trying to query it through Hive.

In Spark (1.5.2) the column is represented correctly:


However, when queried from Hive I get the same column but without any of the
escape characters:


Naturally this breaks my JSON parsers and I'm unable to use the data. Has
anyone encountered this error before? I tried looking through the source but
all I can find that I think is related is the HiveContext.toHiveString
method.

Any advice would be appreciated!

Thanks,
Scott



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-fails-to-quote-strings-tp25877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Spark 1.6 released?

2016-01-04 Thread Annabel Melongo
[1] http://spark.apache.org/releases/spark-release-1-6-0.html[2] 
http://spark.apache.org/downloads.html
 

On Monday, January 4, 2016 2:59 PM, "saif.a.ell...@wellsfargo.com" 
 wrote:
 

 Where can I read more about the dataset api on a user layer? I am failing to 
get an API doc or understand when to use DataFrame or DataSet, advantages, etc.

Thanks,
Saif

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] 
Sent: Monday, January 04, 2016 2:01 PM
To: user@spark.apache.org
Subject: Re: Is Spark 1.6 released?

It's now OK: Michael published and announced the release.

Sorry for the delay.

Regards
JB

On 01/04/2016 10:06 AM, Jung wrote:
> Hi
> There were Spark 1.6 jars in maven central and github.
> I found it 5 days ago. But it doesn't appear on Spark website now.
> May I regard Spark 1.6 zip file in github as a stable release?
>
> Thanks
> Jung
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  

Re: Does state survive application restart in StatefulNetworkWordCount?

2016-01-04 Thread Rado Buranský
I asked the question on Twitter and got this response:
https://twitter.com/jaceklaskowski/status/683923649632579588

Is Jacek right? If you stop and then start the application correctly then
the state is not recovered? It is recovered only in case of failure?

On Mon, Jan 4, 2016 at 8:19 PM, Tathagata Das  wrote:

> It does get recovered if you restart from checkpoints. See the example
> RecoverableNetworkWordCount.scala
>
> On Sat, Jan 2, 2016 at 6:22 AM, Rado Buranský 
> wrote:
>
>> I am trying to understand how state in Spark Streaming works in general.
>> If I run this example program twice will the second run see state from the
>> first run?
>>
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
>>
>> It seems no. Is there a way how to achieve this? I am thinking about
>> redeploying an application an I would like not to loose the current state.
>>
>> Thanks
>>
>
>


Spark Streaming Application is Stuck Under Heavy Load Due to DeadLock

2016-01-04 Thread Rachana Srivastava
Hello All,

I am running my application on Spark cluster but under heavy load the system is 
hung due to deadlock.  I found similar issues resolved here 
https://datastax-oss.atlassian.net/browse/JAVA-555 in  Spark version 2.1.3.  
But I am running on Spark 1.3 still getting the same issue.

Here is the stack trace for reference:

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
org.apache.spark.streaming.ContextWaiter.waitForStopOrError(ContextWaiter.scala:63)
org.apache.spark.streaming.StreamingContext.awaitTermination(StreamingContext.scala:521)
org.apache.spark.streaming.api.java.JavaStreamingContext.awaitTermination(JavaStreamingContext.scala:592)

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:62)
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
org.spark-project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
org.spark-project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
org.spark-project.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
java.lang.Thread.run(Thread.java:745)


Thanks,

Rachana



Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi,

I’m looking for a way to compare subsets of an RDD intelligently.

 Lets say I had an RDD with key/value pairs of type (Int->T). I eventually
need to say “compare all values of key 1 with all values of key 2 and
compare values of key 3 to the values of key 5 and key 7”, how would I go
about doing this efficiently?

The way I’m currently thinking of doing it is by creating a List of
filtered RDDs and then using RDD.cartesian()


def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) =>
name == b}

Val keyPairs:(Int, Int) // all key pairs

Val rddPairs = keyPairs.map{

case (a, b) =>

filterSubset(a,r).cartesian(filterSubset(b,r))

}

rddPairs.map{whatever I want to compare…}



I would then iterate the list and perform a map on each of the RDDs of
pairs to gather the relational data that I need.



What I can’t tell about this idea is whether it would be extremely
inefficient to set up possibly of hundreds of map jobs and then iterate
through them. In this case, would the lazy valuation in spark optimize the
data shuffling between all of the maps? If not, can someone please
recommend a possibly more efficient way to approach this issue?


Thank you for your help


Re: HiveThriftServer fails to quote strings

2016-01-04 Thread Scott Lyons
Apparently nabble ate my code samples.

In Spark (1.5.2) the column is represented correctly:
sqlContext.sql("SELECT * FROM tempdata").collect()
[{"PageHtml":"{\\"time\\":0}"}]

However, when queried from Hive I get the same column but without any of the 
escape characters:
Beeline (or PyHive) > SELECT * FROM tempdata LIMIT 1
[{"PageHtml":"{"time":0}"}]

Thanks for the heads up

From: Ted Yu >
Date: Monday, January 4, 2016 at 11:54 AM
To: Scott Lyons >
Cc: user >
Subject: Re: HiveThriftServer fails to quote strings

bq. without any of the escape characters:

Did you intend to show some sample ?

As far as I can tell, there was no sample or image in previous email.

FYI

On Mon, Jan 4, 2016 at 11:36 AM, sclyon 
> wrote:
Hello all,

I've got a nested JSON structure in parquet format that I'm having some
issues with when trying to query it through Hive.

In Spark (1.5.2) the column is represented correctly:


However, when queried from Hive I get the same column but without any of the
escape characters:


Naturally this breaks my JSON parsers and I'm unable to use the data. Has
anyone encountered this error before? I tried looking through the source but
all I can find that I think is related is the HiveContext.toHiveString
method.

Any advice would be appreciated!

Thanks,
Scott



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveThriftServer-fails-to-quote-strings-tp25877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org




Re: pyspark streaming crashes

2016-01-04 Thread Antony Mayi
just for reference in my case this problem is caused by this bug: 
https://issues.apache.org/jira/browse/SPARK-12617 

On Monday, 21 December 2015, 14:32, Antony Mayi  
wrote:
 
 

 I noticed it might be related to longer GC pauses (1-2 sec) - the crash 
usually occurs after such pause. could that be causing the python-java gateway 
timing out? 

On Sunday, 20 December 2015, 23:05, Antony Mayi  
wrote:
 
 

 Hi,
can anyone please help me troubleshooting this prob: I have a streaming pyspark 
application (spark 1.5.2 on yarn-client) which keeps crashing after few hours. 
Doesn't seem to be running out of mem neither on driver or executors.
driver error:
py4j.protocol.Py4JJavaError: An error occurred while calling 
o1.awaitTermination.: java.io.IOException: py4j.Py4JException: Error while 
obtaining a new communication channel        at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)        at 
org.apache.spark.streaming.api.python.TransformFunction.writeObject(PythonDStream.scala:77)
        at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)

(all) executors error:
  File 
"/u04/yarn/local/usercache/das/appcache/application_1450337892069_0336/container_1450337892069_0336_01_08/pyspark.zip/pyspark/worker.py",
 line 136, in main    if read_int(infile) == SpecialLengths.END_OF_STREAM:  
File 
"/u04/yarn/local/usercache/das/appcache/application_1450337892069_0336/container_1450337892069_0336_01_08/pyspark.zip/pyspark/serializers.py",
 line 545, in read_int    raise EOFError


GC (using G1GC) debugging just before crash:
driver:   [Eden: 2316.0M(2316.0M)->0.0B(2318.0M) Survivors: 140.0M->138.0M 
Heap: 3288.7M(4096.0M)->675.5M(4096.0M)]
executor(s):   [Eden: 2342.0M(2342.0M)->0.0B(2378.0M) Survivors: 52.0M->34.0M 
Heap: 3601.7M(4096.0M)->1242.7M(4096.0M)]

thanks.

 
   

 
  

Re: Unable to run spark SQL Join query.

2016-01-04 Thread ๏̯͡๏
There are three tables in action here.

Table A (success_events.sojsuccessevents1) JOIN TABLE B (dw_bid) to create
TABLE C (sojsuccessevents2_spark)

Now table success_events.sojsuccessevents1 has itemid that i confirmed by
running describe success_events.sojsuccessevents1 from spark-sql shell.

I changed my join query to use itemid.

" on a.itemid = b.item_id  and  a.transactionid =  b.transaction_id " +

But still i get the same error


16/01/04 03:29:27 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'a.itemid' given
input columns bid_flags, slng_chnl_id, upd_user, bdr_id, bid_status_id,
bid_dt, transaction_id, host_ip_addr, upd_date, item_vrtn_id, auct_end_dt,
bid_amt_unit_lstg_curncy, bidding_site_id, cre_user, bid_cobrand_id,
bdr_site_id, app_id, lstg_curncy_id, bid_exchng_rate, bid_date,
ebx_bid_yn_id, cre_date, winning_qty, bid_type_code, half_on_ebay_bid_id,
bdr_cntry_id, qty_bid, item_id; line 1 pos 864)

It appears as if its trying to look for itemid in TABLE B (dw_bid) instead
of TABLE A (success_events.sojsuccessevents1) As above columns are from
TABLE B.

Regards,
Deepak




On Sun, Jan 3, 2016 at 7:42 PM, Jins George  wrote:

> Column 'itemId' is not present in table 'success_events.sojsuccessevents1'
> or  'dw_bid'
>
> did you mean  'sojsuccessevents2_spark' table  in your select query ?
>
> Thanks,
> Jins
>
> On 01/03/2016 07:22 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
>
> Code:
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
> hiveContext.sql("drop table sojsuccessevents2_spark")
>
> hiveContext.sql("CREATE TABLE `sojsuccessevents2_spark`( `guid`
> string COMMENT 'from deserializer', `sessionkey` bigint COMMENT 'from
> deserializer', `sessionstartdate` string COMMENT 'from deserializer',
> `sojdatadate` string COMMENT 'from deserializer', `seqnum` int COMMENT
> 'from deserializer', `eventtimestamp` string COMMENT 'from deserializer',
> `siteid` int COMMENT 'from deserializer', `successeventtype` string COMMENT
> 'from deserializer', `sourcetype` string COMMENT 'from deserializer',
> `itemid` bigint COMMENT 'from deserializer', `shopcartid` bigint COMMENT
> 'from deserializer', `transactionid` bigint COMMENT 'from deserializer',
> `offerid` bigint COMMENT 'from deserializer', `userid` bigint COMMENT 'from
> deserializer', `priorpage1seqnum` int COMMENT 'from deserializer',
> `priorpage1pageid` int COMMENT 'from deserializer',
> `exclwmsearchattemptseqnum` int COMMENT 'from deserializer',
> `exclpriorsearchpageid` int COMMENT 'from deserializer',
> `exclpriorsearchseqnum` int COMMENT 'from deserializer',
> `exclpriorsearchcategory` int COMMENT 'from deserializer',
> `exclpriorsearchl1` int COMMENT 'from deserializer', `exclpriorsearchl2`
> int COMMENT 'from deserializer', `currentimpressionid` bigint COMMENT 'from
> deserializer', `sourceimpressionid` bigint COMMENT 'from deserializer',
> `exclpriorsearchsqr` string COMMENT 'from deserializer',
> `exclpriorsearchsort` string COMMENT 'from deserializer', `isduplicate` int
> COMMENT 'from deserializer', `transactiondate` string COMMENT 'from
> deserializer', `auctiontypecode` int COMMENT 'from deserializer', `isbin`
> int COMMENT 'from deserializer', `leafcategoryid` int COMMENT 'from
> deserializer', `itemsiteid` int COMMENT 'from deserializer', `bidquantity`
> int COMMENT 'from deserializer', `bidamtusd` double COMMENT 'from
> deserializer', `offerquantity` int COMMENT 'from deserializer',
> `offeramountusd` double COMMENT 'from deserializer', `offercreatedate`
> string COMMENT 'from deserializer', `buyersegment` string COMMENT 'from
> deserializer', `buyercountryid` int COMMENT 'from deserializer', `sellerid`
> bigint COMMENT 'from deserializer', `sellercountryid` int COMMENT 'from
> deserializer', `sellerstdlevel` string COMMENT 'from deserializer',
> `csssellerlevel` string COMMENT 'from deserializer', `experimentchannel`
> int COMMENT 'from deserializer') ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION
> 'hdfs://
> apollo-phx-nn.vip.ebay.com:8020/user/dvasthimal/spark/successeventstaging/sojsuccessevents2'
> TBLPROPERTIES ( 'avro.schema.literal'='{\"type\":\"record\",\"name\":\"
> success\",\"namespace\":\"Reporting.detail\",\"doc\":\"\",\"fields\":[{\"
> name\":\"guid\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"
> String\"},\"doc\":\"\",\"default\":\"\"},{\"name\":\"sessionKey\",\"type\"
> :\"long\",\"doc\":\"\",\"default\":0},{\"name\":\"sessionStartDate\",\"
> type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\"doc\":\"\",
> \"default\":\"\"},{\"name\":\"sojDataDate\",\"type\":{\"type\":\"string\",
> \"avro.java.string\":\"String\"},\"doc\":\"\",\"default\":\"\"},{\"name\":

stopping a process usgin an RDD

2016-01-04 Thread domibd
Hello,

Is there a way to stop under a condition a process (like map-reduce) using 
an RDD ?

(this could be use if the process does not always need to
 explore all the RDD)

thanks

Dominique





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stopping-a-process-usgin-an-RDD-tp25870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is Spark 1.6 released?

2016-01-04 Thread Jung
Hi
There were Spark 1.6 jars in maven central and github.
I found it 5 days ago. But it doesn't appear on Spark website now.
May I regard Spark 1.6 zip file in github as a stable release?

Thanks
Jung

Re: Is Spark 1.6 released?

2016-01-04 Thread Jean-Baptiste Onofré

Hi Jung,

yes Spark 1.6.0 has been released December, 28th.

The artifacts are on Maven Central:

http://repo1.maven.org/maven2/org/apache/spark/

However, the distribution is not available on dist.apache.org:

https://dist.apache.org/repos/dist/release/spark/

Let me check with the team to upload the distribution to dist.apache.org.

Regards
JB

On 01/04/2016 10:06 AM, Jung wrote:

Hi
There were Spark 1.6 jars in maven central and github.
I found it 5 days ago. But it doesn't appear on Spark website now.
May I regard Spark 1.6 zip file in github as a stable release?

Thanks
Jung



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Spark 1.6 released?

2016-01-04 Thread Michael Armbrust
>
> bq. In many cases, the current implementation of the Dataset API does not
> yet leverage the additional information it has and can be slower than RDDs.
>
>
Are the characteristics of cases above known so that users can decide which
> API to use ?
>

Lots of back to back operations aren't great yet because we serialize
deseriaize unnecessarily.  For example:
https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/DatasetPerformance.scala#L37


>
> For custom encoders, I did a quick search but didn't find the JIRA number.
> Can you share the JIRA number ?
>

This is probably the closest thing:
https://issues.apache.org/jira/browse/SPARK-7768


Re: Monitor Job on Yarn

2016-01-04 Thread Ted Yu
Please look at history server related content under:
https://spark.apache.org/docs/latest/running-on-yarn.html

Note spark.yarn.historyServer.address
FYI

On Mon, Jan 4, 2016 at 2:49 PM, Daniel Valdivia 
wrote:

> Hello everyone, happy new year,
>
> I submitted an app to yarn, however I'm unable to monitor it's progress on
> the driver node, not in :8080 or :4040 as
> documented, when submitting to the standalone mode I could monitor however
> seems liek its not the case right now.
>
> I submitted my app this way:
>
> spark-submit --class my.class --master yarn --deploy-mode cluster myjar.jar
>
> and so far the job is on it's way it seems, the console is vivid with
> Application report messages, however I can't access the status of the app,
> should I have submitted the app in a different fashion to access the status
> of it?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Monitor Job on Yarn

2016-01-04 Thread Marcelo Vanzin
You should be looking at the YARN RM web ui to monitor YARN
applications; that will have a link to the Spark application's UI,
along with other YARN-related information.

Also, if you run the app in client mode, it might be easier to debug
it until you know it's running properly (since you'll see driver logs
in your terminal).

On Mon, Jan 4, 2016 at 2:49 PM, Daniel Valdivia  wrote:
> Hello everyone, happy new year,
>
> I submitted an app to yarn, however I'm unable to monitor it's progress on 
> the driver node, not in :8080 or :4040 as 
> documented, when submitting to the standalone mode I could monitor however 
> seems liek its not the case right now.
>
> I submitted my app this way:
>
> spark-submit --class my.class --master yarn --deploy-mode cluster myjar.jar
>
> and so far the job is on it's way it seems, the console is vivid with 
> Application report messages, however I can't access the status of the app, 
> should I have submitted the app in a different fashion to access the status 
> of it?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupByKey does not work?

2016-01-04 Thread Ted Yu
Can you give a bit more information ?

Release of Spark you're using
Minimal dataset that shows the problem

Cheers

On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra  wrote:

> I tried groupByKey and noticed that it did not group all values into the
> same group.
>
> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
> distinct keys, so I expected there to be 4 records in the groupByKey
> object, but instead there were 8. Each of the 4 distinct keys appear 2
> times.
>
> Is this the expected behavior? I need to be able to get ALL values
> associated with each key grouped into a SINGLE record. Is it possible?
>
> Arun
>
> p.s. reducebykey will not be sufficient for me
>


Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
Could you please post the associated code and output?

On Mon, Jan 4, 2016 at 3:55 PM Arun Luthra  wrote:

> I tried groupByKey and noticed that it did not group all values into the
> same group.
>
> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
> distinct keys, so I expected there to be 4 records in the groupByKey
> object, but instead there were 8. Each of the 4 distinct keys appear 2
> times.
>
> Is this the expected behavior? I need to be able to get ALL values
> associated with each key grouped into a SINGLE record. Is it possible?
>
> Arun
>
> p.s. reducebykey will not be sufficient for me
>


Monitor Job on Yarn

2016-01-04 Thread Daniel Valdivia
Hello everyone, happy new year,

I submitted an app to yarn, however I'm unable to monitor it's progress on the 
driver node, not in :8080 or :4040 as documented, 
when submitting to the standalone mode I could monitor however seems liek its 
not the case right now.

I submitted my app this way:

spark-submit --class my.class --master yarn --deploy-mode cluster myjar.jar

and so far the job is on it's way it seems, the console is vivid with 
Application report messages, however I can't access the status of the app, 
should I have submitted the app in a different fashion to access the status of 
it?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2016-01-04 Thread Tathagata Das
You could enforce the evaluation of the transformed DStream by putting a
dummy output operation on it, and then do the windowing.

transformedDStream.foreachRDD { _.count() }  // to enforce evaluation of
the trnasformation
transformedDStream.window(...).foreachRDD( rdd => ... }

On Thu, Dec 31, 2015 at 5:54 AM, Ewan Leith 
wrote:

> Yeah it’s awkward, the transforms being done are fairly time sensitive, so
> I don’t want them to wait 60 seconds or more.
>
>
>
> I might have to move the code from a transform into a custom receiver
> instead, so they’ll be processed outside the window length. A buffered
> writer is a good idea too, thanks.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> *From:* Ashic Mahtab [mailto:as...@live.com]
> *Sent:* 31 December 2015 13:50
> *To:* Ewan Leith ; Apache Spark <
> user@spark.apache.org>
> *Subject:* RE: Batch together RDDs for Streaming output, without delaying
> execution of map or transform functions
>
>
>
> Hi Ewan,
>
> Transforms are definitions of what needs to be done - they don't execute
> until and action is triggered. For what you want, I think you might need to
> have an action that writes out rdds to some sort of buffered writer.
>
>
>
> -Ashic.
> --
>
> From: ewan.le...@realitymine.com
> To: user@spark.apache.org
> Subject: Batch together RDDs for Streaming output, without delaying
> execution of map or transform functions
> Date: Thu, 31 Dec 2015 11:35:37 +
>
> Hi all,
>
>
>
> I’m sure this must have been solved already, but I can’t see anything
> obvious.
>
>
>
> Using Spark Streaming, I’m trying to execute a transform function on a
> DStream at short batch intervals (e.g. 1 second), but only write the
> resulting data to disk using saveAsTextFiles in a larger batch after a
> longer delay (say 60 seconds).
>
>
>
> I thought the ReceiverInputDStream window function might be a good help
> here, but instead, applying it to a transformed DStream causes the
> transform function to only execute at the end of the window too.
>
>
>
> Has anyone got a solution to this?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>
>
>


Re: Monitor Job on Yarn

2016-01-04 Thread Daniel Valdivia
I see, I guess I should have set the historyServer.


Strangely enough peeking in the yarn seems like nothing is "happening", it list 
a single application running with 0% progress but each node has 0 running 
containers which confuses me to wether anything is actually happening

Should I restart the job with the spark.yarn.historyServer.address ?

[hadoop@sslabnode02 ~]$ yarn node -list
16/01/04 14:43:40 INFO client.RMProxy: Connecting to ResourceManager at 
sslabnode01/10.15.235.239:8032
16/01/04 14:43:40 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Total Nodes:3
 Node-Id Node-State Node-Http-Address   
Number-of-Running-Containers
sslabnode02:54142   RUNNING  sslabnode02:8042   
   0
sslabnode01:60780   RUNNING  sslabnode01:8042   
   0
sslabnode03:60569   RUNNING  sslabnode03:8042   
   0
[hadoop@sslabnode02 ~]$ yarn application -list
16/01/04 14:43:55 INFO client.RMProxy: Connecting to ResourceManager at 
sslabnode01/10.15.235.239:8032
16/01/04 14:43:55 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Total number of applications (application-types: [] and states: [SUBMITTED, 
ACCEPTED, RUNNING]):1
Application-Id  Application-NameApplication-Type
  User   Queue   State Final-State  
   ProgressTracking-URL
application_1451947397662_0001  ClusterIncidents   SPARK
hadoop defaultACCEPTED   UNDEFINED  
 0% N/A

> On Jan 4, 2016, at 2:55 PM, Ted Yu  wrote:
> 
> Please look at history server related content under:
> https://spark.apache.org/docs/latest/running-on-yarn.html 
> 
> 
> Note spark.yarn.historyServer.address
> FYI
> 
> On Mon, Jan 4, 2016 at 2:49 PM, Daniel Valdivia  > wrote:
> Hello everyone, happy new year,
> 
> I submitted an app to yarn, however I'm unable to monitor it's progress on 
> the driver node, not in :8080 or :4040 as 
> documented, when submitting to the standalone mode I could monitor however 
> seems liek its not the case right now.
> 
> I submitted my app this way:
> 
> spark-submit --class my.class --master yarn --deploy-mode cluster myjar.jar
> 
> and so far the job is on it's way it seems, the console is vivid with 
> Application report messages, however I can't access the status of the app, 
> should I have submitted the app in a different fashion to access the status 
> of it?
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



groupByKey does not work?

2016-01-04 Thread Arun Luthra
I tried groupByKey and noticed that it did not group all values into the
same group.

In my test dataset (a Pair rdd) I have 16 records, where there are only 4
distinct keys, so I expected there to be 4 records in the groupByKey
object, but instead there were 8. Each of the 4 distinct keys appear 2
times.

Is this the expected behavior? I need to be able to get ALL values
associated with each key grouped into a SINGLE record. Is it possible?

Arun

p.s. reducebykey will not be sufficient for me


Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
Spark 1.5.0

data:

p1,lo1,8,0,4,0,5,20150901|5,1,1.0
p1,lo2,8,0,4,0,5,20150901|5,1,1.0
p1,lo3,8,0,4,0,5,20150901|5,1,1.0
p1,lo4,8,0,4,0,5,20150901|5,1,1.0
p1,lo1,8,0,4,0,5,20150901|5,1,1.0
p1,lo2,8,0,4,0,5,20150901|5,1,1.0
p1,lo3,8,0,4,0,5,20150901|5,1,1.0
p1,lo4,8,0,4,0,5,20150901|5,1,1.0
p1,lo1,8,0,4,0,5,20150901|5,1,1.0
p1,lo2,8,0,4,0,5,20150901|5,1,1.0
p1,lo3,8,0,4,0,5,20150901|5,1,1.0
p1,lo4,8,0,4,0,5,20150901|5,1,1.0
p1,lo1,8,0,4,0,5,20150901|5,1,1.0
p1,lo2,8,0,4,0,5,20150901|5,1,1.0
p1,lo3,8,0,4,0,5,20150901|5,1,1.0
p1,lo4,8,0,4,0,5,20150901|5,1,1.0

spark-shell:

spark-shell \
--num-executors 2 \
--driver-memory 1g \
--executor-memory 10g \
--executor-cores 8 \
--master yarn-client


case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
f4:Char, f5:Char, f6:String)
case class Myvalue(count1:Long, count2:Long, num:Double)

val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
val spl = line.split("\\|", -1)
val k = spl(0).split(",")
val v = spl(1).split(",")
(Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
k(5)(0).toChar, k(6)(0).toChar, k(7)),
 Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
)
}}

myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
}.collect().foreach(println)

(Mykey(p1,lo1,8,0,4,0,5,20150901),1)

(Mykey(p1,lo1,8,0,4,0,5,20150901),1)
(Mykey(p1,lo3,8,0,4,0,5,20150901),1)
(Mykey(p1,lo3,8,0,4,0,5,20150901),1)
(Mykey(p1,lo4,8,0,4,0,5,20150901),1)
(Mykey(p1,lo4,8,0,4,0,5,20150901),1)
(Mykey(p1,lo2,8,0,4,0,5,20150901),1)
(Mykey(p1,lo2,8,0,4,0,5,20150901),1)



You can see that each key is repeated 2 times but each key should only
appear once.

Arun

On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu  wrote:

> Can you give a bit more information ?
>
> Release of Spark you're using
> Minimal dataset that shows the problem
>
> Cheers
>
> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra  wrote:
>
>> I tried groupByKey and noticed that it did not group all values into the
>> same group.
>>
>> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
>> distinct keys, so I expected there to be 4 records in the groupByKey
>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>> times.
>>
>> Is this the expected behavior? I need to be able to get ALL values
>> associated with each key grouped into a SINGLE record. Is it possible?
>>
>> Arun
>>
>> p.s. reducebykey will not be sufficient for me
>>
>
>


Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
Could you try simplifying the key and seeing if that makes any difference?
Make it just a string or an int so we can count out any issues in object
equality.

On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra  wrote:

> Spark 1.5.0
>
> data:
>
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>
> spark-shell:
>
> spark-shell \
> --num-executors 2 \
> --driver-memory 1g \
> --executor-memory 10g \
> --executor-cores 8 \
> --master yarn-client
>
>
> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
> f4:Char, f5:Char, f6:String)
> case class Myvalue(count1:Long, count2:Long, num:Double)
>
> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
> val spl = line.split("\\|", -1)
> val k = spl(0).split(",")
> val v = spl(1).split(",")
> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>  Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
> )
> }}
>
> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
> }.collect().foreach(println)
>
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>
>
>
> You can see that each key is repeated 2 times but each key should only
> appear once.
>
> Arun
>
> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu  wrote:
>
>> Can you give a bit more information ?
>>
>> Release of Spark you're using
>> Minimal dataset that shows the problem
>>
>> Cheers
>>
>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra 
>> wrote:
>>
>>> I tried groupByKey and noticed that it did not group all values into the
>>> same group.
>>>
>>> In my test dataset (a Pair rdd) I have 16 records, where there are only
>>> 4 distinct keys, so I expected there to be 4 records in the groupByKey
>>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>>> times.
>>>
>>> Is this the expected behavior? I need to be able to get ALL values
>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>
>>> Arun
>>>
>>> p.s. reducebykey will not be sufficient for me
>>>
>>
>>
>


Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread Don Drake
You will need to use the HDFS API to do that.

Try something like:

val conf = sc.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
fs.rename(new org.apache.hadoop.fs.Path("/path/on/hdfs/file.txt"), new
org.apache.hadoop.fs.Path("/path/on/hdfs/other/file.txt"))

Full API for FileSystem is here:
https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/fs/FileSystem.html

-Don


On Mon, Jan 4, 2016 at 9:07 PM, Zhiliang Zhu 
wrote:

>
> For some file on hdfs, it is necessary to copy/move it to some another
> specific hdfs  directory, and the directory name would keep unchanged.
> Just need finish it in spark program, but not hdfs commands.
> Is there any codes, it seems not to be done by searching spark doc ...
>
> Thanks in advance!
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


copy/mv hdfs file to another directory by spark program

2016-01-04 Thread Zhiliang Zhu

For some file on hdfs, it is necessary to copy/move it to some another specific 
hdfs  directory, and the directory name would keep unchanged.Just need finish 
it in spark program, but not hdfs commands.Is there any codes, it seems not to 
be done by searching spark doc ...
Thanks in advance! 

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
Hi Alexander,

That's cool! Thanks for the clarification.

Yanbo

2016-01-05 5:06 GMT+08:00 Ulanov, Alexander :

> Hi Yanbo,
>
>
>
> As long as two models fit into memory of a single machine, there should be
> no problems, so even 16GB machines can handle large models. (master should
> have more memory because it runs LBFGS) In my experiments, I’ve trained the
> models 12M and 32M parameters without issues.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Yanbo Liang [mailto:yblia...@gmail.com]
> *Sent:* Sunday, December 27, 2015 2:23 AM
> *To:* Joseph Bradley
> *Cc:* Eugene Morozov; user; d...@spark.apache.org
> *Subject:* Re: SparkML algos limitations question.
>
>
>
> Hi Eugene,
>
>
>
> AFAIK, the current implementation of MultilayerPerceptronClassifier have
> some scalability problems if the model is very huge (such as >10M),
> although I think the limitation can cover many use cases already.
>
>
>
> Yanbo
>
>
>
> 2015-12-16 6:00 GMT+08:00 Joseph Bradley :
>
> Hi Eugene,
>
>
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
>
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
>
>
> Joseph
>
>
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
> Hello!
>
>
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
>
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
>
>
> Thanks in advance for your time.
>
> --
> Be well!
> Jean Morozov
>
>
>
>
>


Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
If I simplify the key to String column with values lo1, lo2, lo3, lo4, it
works correctly.

On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman 
wrote:

> Could you try simplifying the key and seeing if that makes any difference?
> Make it just a string or an int so we can count out any issues in object
> equality.
>
> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra  wrote:
>
>> Spark 1.5.0
>>
>> data:
>>
>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>>
>> spark-shell:
>>
>> spark-shell \
>> --num-executors 2 \
>> --driver-memory 1g \
>> --executor-memory 10g \
>> --executor-cores 8 \
>> --master yarn-client
>>
>>
>> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
>> f4:Char, f5:Char, f6:String)
>> case class Myvalue(count1:Long, count2:Long, num:Double)
>>
>> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
>> val spl = line.split("\\|", -1)
>> val k = spl(0).split(",")
>> val v = spl(1).split(",")
>> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
>> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>>  Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
>> )
>> }}
>>
>> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
>> }.collect().foreach(println)
>>
>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>
>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>
>>
>>
>> You can see that each key is repeated 2 times but each key should only
>> appear once.
>>
>> Arun
>>
>> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu  wrote:
>>
>>> Can you give a bit more information ?
>>>
>>> Release of Spark you're using
>>> Minimal dataset that shows the problem
>>>
>>> Cheers
>>>
>>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra 
>>> wrote:
>>>
 I tried groupByKey and noticed that it did not group all values into
 the same group.

 In my test dataset (a Pair rdd) I have 16 records, where there are only
 4 distinct keys, so I expected there to be 4 records in the groupByKey
 object, but instead there were 8. Each of the 4 distinct keys appear 2
 times.

 Is this the expected behavior? I need to be able to get ALL values
 associated with each key grouped into a SINGLE record. Is it possible?

 Arun

 p.s. reducebykey will not be sufficient for me

>>>
>>>
>>


Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
That's interesting.

I would try

case class Mykey(uname:String)
case class Mykey(uname:String, c1:Char)
case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
f4:Char, f5:Char, f6:String)

In that order. It seems like there is some issue with equality between keys.

On Mon, Jan 4, 2016 at 5:05 PM Arun Luthra  wrote:

> If I simplify the key to String column with values lo1, lo2, lo3, lo4, it
> works correctly.
>
> On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman  > wrote:
>
>> Could you try simplifying the key and seeing if that makes any
>> difference? Make it just a string or an int so we can count out any issues
>> in object equality.
>>
>> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra  wrote:
>>
>>> Spark 1.5.0
>>>
>>> data:
>>>
>>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo3,8,0,4,0,5,20150901|5,1,1.0
>>> p1,lo4,8,0,4,0,5,20150901|5,1,1.0
>>>
>>> spark-shell:
>>>
>>> spark-shell \
>>> --num-executors 2 \
>>> --driver-memory 1g \
>>> --executor-memory 10g \
>>> --executor-cores 8 \
>>> --master yarn-client
>>>
>>>
>>> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
>>> f4:Char, f5:Char, f6:String)
>>> case class Myvalue(count1:Long, count2:Long, num:Double)
>>>
>>> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
>>> val spl = line.split("\\|", -1)
>>> val k = spl(0).split(",")
>>> val v = spl(1).split(",")
>>> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
>>> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>>>  Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
>>> )
>>> }}
>>>
>>> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
>>> }.collect().foreach(println)
>>>
>>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>>
>>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>>
>>>
>>>
>>> You can see that each key is repeated 2 times but each key should only
>>> appear once.
>>>
>>> Arun
>>>
>>> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu  wrote:
>>>
 Can you give a bit more information ?

 Release of Spark you're using
 Minimal dataset that shows the problem

 Cheers

 On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra 
 wrote:

> I tried groupByKey and noticed that it did not group all values into
> the same group.
>
> In my test dataset (a Pair rdd) I have 16 records, where there are
> only 4 distinct keys, so I expected there to be 4 records in the 
> groupByKey
> object, but instead there were 8. Each of the 4 distinct keys appear 2
> times.
>
> Is this the expected behavior? I need to be able to get ALL values
> associated with each key grouped into a SINGLE record. Is it possible?
>
> Arun
>
> p.s. reducebykey will not be sufficient for me
>


>>>
>


Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
Hi Tomasz,

The limitation will not be changed and you will found all the models
reference to SparkContext in the new Spark ML package. It make the Python
API simple for implementation.

But it does not means you can only call this function on local data, you
can operate this function on an RDD like the following code snippet:

gmmModel.predictSoft(rdd)

then you will get a new RDD which is the soft prediction result. And all
the models in ML package follow this rule.

Yanbo

2016-01-04 22:16 GMT+08:00 Tomasz Fruboes :

> Hi Yanbo,
>
>  thanks for info. Is it likely to change in (near :) ) future? Ability to
> call this function only on local data (ie not in rdd) seems to be rather
> serious limitation.
>
>  cheers,
>   Tomasz
>
> On 02.01.2016 09:45, Yanbo Liang wrote:
>
>> Hi Tomasz,
>>
>> The GMM is bind with the peer Java GMM object, so it need reference to
>> SparkContext.
>> Some of MLlib(not ML) models are simple object such as KMeansModel,
>> LinearRegressionModel etc., but others will refer SparkContext. The
>> later ones and corresponding member functions should not called in map().
>>
>> Cheers
>> Yanbo
>>
>>
>>
>> 2016-01-01 4:12 GMT+08:00 Tomasz Fruboes > >:
>>
>> Dear All,
>>
>>   I'm trying to implement a procedure that iteratively updates a rdd
>> using results from GaussianMixtureModel.predictSoft. In order to
>> avoid problems with local variable (the obtained GMM) beeing
>> overwritten in each pass of the loop I'm doing the following:
>>
>> ###
>> for i in xrange(10):
>>  gmm = GaussianMixture.train(rdd, 2)
>>
>>  def getSafePredictor(unsafeGMM):
>>  return lambda x: \
>>  (unsafeGMM.predictSoft(x.features),
>> unsafeGMM.gaussians.mu )
>>
>>
>>  safePredictor = getSafePredictor(gmm)
>>  predictionsRDD = (labelledpointrddselectedfeatsNansPatched
>>.map(safePredictor)
>>  )
>>  print predictionsRDD.take(1)
>>  (... - rest of code - update rdd with results from
>> predictionsRdd)
>> ###
>>
>> Unfortunately this ends with:
>>
>> ###
>> Exception: It appears that you are attempting to reference
>> SparkContext from a broadcast variable, action, or transformation.
>> SparkContext can only be used on the driver, not in code that it run
>> on workers. For more information, see SPARK-5063.
>> ###
>>
>> Any idea why I'm getting this behaviour? My expectation would be,
>> that GMM should be a "simple" object without SparkContext in it.
>> I'm using spark 1.5.2
>>
>>   Thanks,
>> Tomasz
>>
>>
>> ps As a workaround I'm doing currently
>>
>> 
>>  def getSafeGMM(unsafeGMM):
>>  return lambda x: unsafeGMM.predictSoft(x)
>>
>>  safeGMM = getSafeGMM(gmm)
>>  predictionsRDD = \
>>  safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))
>> 
>>   which works fine. If it's possible I would like to avoid this
>> approach, since it would require to perform another closure on
>> gmm.gaussians later in my code
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>
>>
>>
>


Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread ayan guha
My guess is No, unless you are okay to read the data and write it back
again.

On Tue, Jan 5, 2016 at 2:07 PM, Zhiliang Zhu 
wrote:

>
> For some file on hdfs, it is necessary to copy/move it to some another
> specific hdfs  directory, and the directory name would keep unchanged.
> Just need finish it in spark program, but not hdfs commands.
> Is there any codes, it seems not to be done by searching spark doc ...
>
> Thanks in advance!
>



-- 
Best Regards,
Ayan Guha


problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-04 Thread Andy Davidson
I am having a heck of a time writing a simple transformer in Java. I assume
that my Transformer is supposed to append a new column to the dataFrame
argument. Any idea why I get the following exception in Java 8 when I try to
call DataFrame withColumn()? The JavaDoc says withColumn() "Returns a new
DataFrame 
  by adding a column or replacing the existing column that has the
same name.²


Also do transformers always run in the driver? If not I assume workers do
not have the sqlContext. Any idea how I can convert an javaRDD<> to a Column
with out a sqlContext?

Kind regards

Andy

P.s. I am using spark 1.6.0

org.apache.spark.sql.AnalysisException: resolved attribute(s)
filteredOutput#1 missing from rawInput#0 in operator !Project
[rawInput#0,filteredOutput#1 AS filteredOutput#2];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(Chec
kAnalysis.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:
44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$
1.apply(CheckAnalysis.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$
1.apply(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(Che
ckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala
:44)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.
scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(Data
Frame.scala:2165)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1227)
at com.pws.poc.ml.StemmerTransformer.transform(StemmerTransformer.java:110)
at 
com.pws.poc.ml.StemmerTransformerTest.test(StemmerTransformerTest.java:45)



public class StemmerTransformer extends Transformer implements Serializable
{

   String inputCol; // unit test sets to rawInput
   String outputCol; // unit test sets to filteredOutput

  Š


  public StemmerTransformer(SQLContext sqlContext) {

// will only work if transformers execute in the driver

this.sqlContext = sqlContext;

}


 @Override

public DataFrame transform(DataFrame df) {

df.printSchema();

df.show();



JavaRDD inRowRDD = df.select(inputCol).javaRDD();

JavaRDD outRowRDD = inRowRDD.map((Row row) -> {

// TODO add stemming code

// Create a new Row

Row ret = RowFactory.create("TODO");

return ret;

});



//can we create a Col from a JavaRDD?



List fields = new ArrayList();

boolean nullable = true;

fields.add(DataTypes.createStructField(outputCol,
DataTypes.StringType, nullable));



StructType schema =  DataTypes.createStructType(fields);

DataFrame outputDF = sqlContext.createDataFrame(outRowRDD, schema);

outputDF.printSchema();

outputDF.show();

Column newCol = outputDF.col(outputCol);



return df.withColumn(outputCol, newCol);

}



SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type
[ch.qos.logback.classic.util.ContextSelectorStaticBinder]

WARN  03:58:46 main o.a.h.u.NativeCodeLoader  line:62 Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable

root

 |-- rawInput: array (nullable = false)

 ||-- element: string (containsNull = true)



++

|rawInput|

++

|[I, saw, the, red...|

|[Mary, had, a, li...|

|[greet, greeting,...|

++



root

 |-- filteredOutput: string (nullable = true)



+--+

|filteredOutput|

+--+

|  TODO|

|  TODO|

|  TODO|

+--+






Re: stopping a process usgin an RDD

2016-01-04 Thread Michael Segel
Not really a good idea. 

It breaks the paradigm. 

If I understand the OP’s idea… they want to halt processing the RDD, but not 
the entire job. 
So when it hits a certain condition, it will stop that task yet continue on to 
the next RDD. (Assuming you have more RDDs or partitions than you have task 
’slots’)  So if you fail enough RDDs, your job fails meaning you don’t get any 
results. 

The best you could do is a NOOP.  That is… if your condition is met on that 
RDD, your M/R job will not output anything to the collection so no more data is 
being added to the result set. 

The whole paradigm is to process the entire RDD at the time. 

You may spin cycles, but that’s not a really bad thing. 

HTH

-Mike

> On Jan 4, 2016, at 6:45 AM, Daniel Darabos  
> wrote:
> 
> You can cause a failure by throwing an exception in the code running on the 
> executors. The task will be retried (if spark.task.maxFailures > 1), and then 
> the stage is failed. No further tasks are processed after that, and an 
> exception is thrown on the driver. You could catch the exception and see if 
> it was caused by your own special exception.
> 
> On Mon, Jan 4, 2016 at 1:05 PM, domibd  > wrote:
> Hello,
> 
> Is there a way to stop under a condition a process (like map-reduce) using
> an RDD ?
> 
> (this could be use if the process does not always need to
>  explore all the RDD)
> 
> thanks
> 
> Dominique
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/stopping-a-process-usgin-an-RDD-tp25870.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: stopping a process usgin an RDD

2016-01-04 Thread Daniel Darabos
You can cause a failure by throwing an exception in the code running on the
executors. The task will be retried (if spark.task.maxFailures > 1), and
then the stage is failed. No further tasks are processed after that, and an
exception is thrown on the driver. You could catch the exception and see if
it was caused by your own special exception.

On Mon, Jan 4, 2016 at 1:05 PM, domibd  wrote:

> Hello,
>
> Is there a way to stop under a condition a process (like map-reduce) using
> an RDD ?
>
> (this could be use if the process does not always need to
>  explore all the RDD)
>
> thanks
>
> Dominique
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/stopping-a-process-usgin-an-RDD-tp25870.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Tomasz Fruboes

Hi Yanbo,

 thanks for info. Is it likely to change in (near :) ) future? Ability 
to call this function only on local data (ie not in rdd) seems to be 
rather serious limitation.


 cheers,
  Tomasz

On 02.01.2016 09:45, Yanbo Liang wrote:

Hi Tomasz,

The GMM is bind with the peer Java GMM object, so it need reference to
SparkContext.
Some of MLlib(not ML) models are simple object such as KMeansModel,
LinearRegressionModel etc., but others will refer SparkContext. The
later ones and corresponding member functions should not called in map().

Cheers
Yanbo



2016-01-01 4:12 GMT+08:00 Tomasz Fruboes >:

Dear All,

  I'm trying to implement a procedure that iteratively updates a rdd
using results from GaussianMixtureModel.predictSoft. In order to
avoid problems with local variable (the obtained GMM) beeing
overwritten in each pass of the loop I'm doing the following:

###
for i in xrange(10):
 gmm = GaussianMixture.train(rdd, 2)

 def getSafePredictor(unsafeGMM):
 return lambda x: \
 (unsafeGMM.predictSoft(x.features),
unsafeGMM.gaussians.mu )

 safePredictor = getSafePredictor(gmm)
 predictionsRDD = (labelledpointrddselectedfeatsNansPatched
   .map(safePredictor)
 )
 print predictionsRDD.take(1)
 (... - rest of code - update rdd with results from predictionsRdd)
###

Unfortunately this ends with:

###
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
###

Any idea why I'm getting this behaviour? My expectation would be,
that GMM should be a "simple" object without SparkContext in it.
I'm using spark 1.5.2

  Thanks,
Tomasz


ps As a workaround I'm doing currently


 def getSafeGMM(unsafeGMM):
 return lambda x: unsafeGMM.predictSoft(x)

 safeGMM = getSafeGMM(gmm)
 predictionsRDD = \
 safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))

  which works fine. If it's possible I would like to avoid this
approach, since it would require to perform another closure on
gmm.gaussians later in my code


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Trying to run GraphX ConnectedComponents for large data with out success

2016-01-04 Thread Dagan, Arnon
While trying to run a spark job with spark 1.5.1, using the following paramters:
--master "yarn"
--deploy-mode "cluster"
--num-executors 200
 --driver-memory 14G
--executor-memory 14G
--executor-cores 1

Trying to run graphX ConnectedComponent on large data (~4TB) using the 
following commands:

System.setProperty("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
val edges = ...
val graph = Graph.fromEdgeTuples(edges,0,edgeStorageLevel = 
StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK)
val components = graph.connectedComponents().vertices
Some of the tasks complete successfully, and some fail with the following 
errors:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 2
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:456)
at 
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:183)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:47)
at 
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

and another error:
org.apache.spark.shuffle.FetchFailedException: Connection from 
phxaishdc9dn1209.stratus.phx.ebay.com/10.115.60.32:40099 closed
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:321)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:306)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 

unsubscribe

2016-01-04 Thread Irvin
unsubscribe


-- 
Thanks & Best Regards

[discuss] dropping Python 2.6 support

2016-01-04 Thread Reynold Xin
Does anybody here care about us dropping support for Python 2.6 in Spark
2.0?

Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
parsing) when compared with Python 2.7. Some libraries that Spark depend on
stopped supporting 2.6. We can still convince the library maintainers to
support 2.6, but it will be extra work. I'm curious if anybody still uses
Python 2.6 to run Spark.

Thanks.


Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-04 Thread Michael Armbrust
Its not really possible to convert an RDD to a Column.  You can think of a
Column as an expression that produces a single output given some set of
input columns.  If I understand your code correctly, I think this might be
easier to express as a UDF:

sqlContext.udf().register("stem", new UDF1() {
  @Override
  public String call(String str) {
return // TODO: stemming code here
  }
}, DataTypes.StringType);

DataFrame transformed = df.withColumn("filteredInput", expr("stem(rawInput)"));


On Mon, Jan 4, 2016 at 8:08 PM, Andy Davidson  wrote:

> I am having a heck of a time writing a simple transformer in Java. I
> assume that my Transformer is supposed to append a new column to the
> dataFrame argument. Any idea why I get the following exception in Java 8
> when I try to call DataFrame withColumn()? The JavaDoc says withColumn() 
> "Returns
> a new DataFrame
> 
>  by adding a column or replacing the existing column that has the same
> name.”
>
>
> Also do transformers always run in the driver? If not I assume workers do
> not have the sqlContext. Any idea how I can convert an javaRDD<> to a
> Column with out a sqlContext?
>
> Kind regards
>
> Andy
>
> P.s. I am using spark 1.6.0
>
> org.apache.spark.sql.AnalysisException: resolved attribute(s)
> filteredOutput#1 missing from rawInput#0 in operator !Project
> [rawInput#0,filteredOutput#1 AS filteredOutput#2];
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame.org
> $apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1227)
> at com.pws.poc.ml.StemmerTransformer.transform(StemmerTransformer.java:110)
> at
> com.pws.poc.ml.StemmerTransformerTest.test(StemmerTransformerTest.java:45)
>
>
>
> public class StemmerTransformer extends Transformer implements
> Serializable {
>String inputCol; // unit test sets to rawInput
>
>String outputCol; // unit test sets to filteredOutput
>
>   …
>
>
>   public StemmerTransformer(SQLContext sqlContext) {
>
> // will only work if transformers execute in the driver
>
> this.sqlContext = sqlContext;
>
> }
>
>
>  @Override
>
> public DataFrame transform(DataFrame df) {
>
> df.printSchema();
>
> df.show();
>
>
>
> JavaRDD inRowRDD = df.select(inputCol).javaRDD();
>
> JavaRDD outRowRDD = inRowRDD.map((Row row) -> {
>
> // TODO add stemming code
>
> // Create a new Row
>
> Row ret = RowFactory.create("TODO");
>
> return ret;
>
> });
>
>
>
> //can we create a Col from a JavaRDD?
>
>
>
> List fields = new ArrayList();
>
> boolean nullable = true;
>
> fields.add(DataTypes.createStructField(outputCol, DataTypes.
> StringType, nullable));
>
>
> StructType schema =  DataTypes.createStructType(fields);
>
> DataFrame outputDF = sqlContext.createDataFrame(outRowRDD, schema
> );
>
> outputDF.printSchema();
>
> outputDF.show();
>
> Column newCol = outputDF.col(outputCol);
>
>
>
> return df.withColumn(outputCol, newCol);
>
> }
>
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
>
> SLF4J: Actual binding is of type
> [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
>
> WARN  03:58:46 main o.a.h.u.NativeCodeLoader  line:62 Unable to
> load native-hadoop library for your platform... using builtin-java classes
> where applicable
>
> root
>
>  |-- rawInput: array (nullable = false)
>
>  ||-- element: string (containsNull = true)
>
>
> ++
>
> |rawInput|
>
> ++
>
> |[I, saw, the, red...|
>
> |[Mary, had, a, li...|
>
> |[greet, greeting,...|
>
> ++
>
>
> root
>
>  |-- filteredOutput: string (nullable = true)
>
>
> +--+
>
> |filteredOutput|
>
> +--+
>
> |  TODO|
>
> |  TODO|
>
> |  TODO|
>
> +--+
>
>
>

Security authentication interface for Spark

2016-01-04 Thread jiehua
Hi  All,

We are using Spark 1.4.1/1.5.2 standalone mode and would like to add 3rd
party user authentication for Spark. We found for batch submission(cluster
mode, but not restful), there was Akka authentication (by security cookie to
ensure identical between both sides) while client connecting to master.
However we found if we wanted to add username/password or credential 
authentication, we would have to change Akka source code.

 Is there anyway in Spark side to support this? Will Spark side expose any
security authentication interfaces to enable 3rd party security plugin?

Thanks.
 
 
Jie Hua 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Security-authentication-interface-for-Spark-tp25879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[ANNOUNCE] Announcing Spark 1.6.0

2016-01-04 Thread Michael Armbrust
Hi All,

Spark 1.6.0 is the seventh release on the 1.x line. This release includes
patches from 248+ contributors! To download Spark 1.6.0 visit the downloads
page.  (It may take a while for all mirrors to update.)

A huge thanks go to all of the individuals and organizations involved in
development and testing of this release.

Visit the release notes [1] to read about the new features, or download [2]
the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-6-0.html
[2] http://spark.apache.org/downloads.html


Re: Is Spark 1.6 released?

2016-01-04 Thread Jean-Baptiste Onofré

It's now OK: Michael published and announced the release.

Sorry for the delay.

Regards
JB

On 01/04/2016 10:06 AM, Jung wrote:

Hi
There were Spark 1.6 jars in maven central and github.
I found it 5 days ago. But it doesn't appear on Spark website now.
May I regard Spark 1.6 zip file in github as a stable release?

Thanks
Jung



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: email not showing up on the mailing list

2016-01-04 Thread Mattmann, Chris A (3980)
Moving user-owner to BCC.

Hi Daniel, please:

1. send an email to user-subscr...@spark.apache.org. Wait for an
automated reply that should let you know how to finish subscribing.
2. once done, post email to user@spark.apache.org from your email
that you subscribed with in 1 and it should work fine.

Thanks!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Daniel Imberman 
Date: Monday, January 4, 2016 at 11:53 AM
To: "user-ow...@spark.apache.org" 
Subject: email not showing up on the mailing list

>Hi,
>
>
>I'm sorry to bother but I'm unfortunately having trouble posting to the
>boards. An hour ago I send an email to
>user@spark.apache.org and even though other peoples emails are showing up
>I cant see mine anywhere. Is there some sort of permission I need to get
>before I can post to the mailing list?
>
>
>Thank you for your help,
>
>
>Daniel
>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.4 RDD to DF fails with toDF()

2016-01-04 Thread Fab
Good catch, thanks. Things work now after changing the version.

For reference, I got the" 2.11" version from my separate download of Scala:
$ scala
Welcome to Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_67).

But my Spark version is indeed running Scala "2.10":
park-shell
16/01/04 12:47:18 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_67)

After changing my sbt file, I could finally submit my scala code to spark.
Thanks!


On Mon, Jan 4, 2016 at 12:26 PM, nraychaudhuri [via Apache Spark User List]
 wrote:

> Is the Spark distribution you are using built with scala 2.11? I think the
> default one is built using scala 2.10.
>
> Nilanjan
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-RDD-to-DF-fails-with-toDF-tp23499p25874.html
> To unsubscribe from Spark 1.4 RDD to DF fails with toDF(), click here
> 
> .
> NAML
> 
>



-- 


*Fabrice Guillaume*VP of Engineering
CircleBack, Inc.

Office   +1 (703) 520- x755
Cell  +1 (703) 928-1851
Email   fabr...@circleback.com
Download CircleBack for iOS  & Android





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-RDD-to-DF-fails-with-toDF-tp23499p25875.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

subscribe

2016-01-04 Thread Suresh Thalamati



Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Mike Wright
Has anyone used Spark Job Server on a "kerberized" cluster in YARN-Client
mode? When Job Server contacts the YARN resource manager, we see a "Cannot
impersonate root" error and am not sure what we have misconfigured.

Thanks.

___

*Mike Wright*
Principal Architect, Software Engineering
S Capital IQ and SNL

434-951-7816 *p*
434-244-4466 *f*
540-470-0119 *m*

mwri...@snl.com