Re: How can a deserialized Java object be stored on disk?

2014-08-31 Thread Sean Owen
Yes, there's no such thing as writing a deserialized form to disk.
However there are other persistence levels that store *serialized*
forms in memory. The meaning here is that the objects are not
serialized in memory in the JVM. Of course, they are serialized on
disk.

On Sun, Aug 31, 2014 at 5:02 AM, Tao Xiao xiaotao.cs@gmail.com wrote:
 Reading about RDD Persistency, I learned that the storage level
 MEMORY_AND_DISK means that  Store RDD as deserialized Java objects in the
 JVM. If the RDD does not fit in memory, store the partitions that don't fit
 on disk, and read them from there when they're needed. 

 But how can a deserialized Java object be stored on disk? As far as I
 know, a Java object should be stored as an array of bytes on disk, which
 means that Java object should be firtly converted into an array of bytes (a
 serialized object).

 Thanks .

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



jdbcRDD from JAVA

2014-08-31 Thread Ahmad Osama
hi,

is there a simple example for jdbcRDD from JAVA and not scala,

trying to figure out the last parameter in the constructor of jdbcRDD

thanks


Re: jdbcRDD from JAVA

2014-08-31 Thread Sean Owen
https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/JdbcRDD.html#JdbcRDD(org.apache.spark.SparkContext,
scala.Function0, java.lang.String, long, long, int, scala.Function1,
scala.reflect.ClassTag)

I don't think there is a completely Java-friendly version of this
class. However you should be able to get away with passing something
generic like ClassTag$.MODULE$.Kapply(Object.class)  There's
probably something even simpler.

On Sun, Aug 31, 2014 at 3:07 PM, Ahmad Osama aos...@gmail.com wrote:
 hi,

 is there a simple example for jdbcRDD from JAVA and not scala,

 trying to figure out the last parameter in the constructor of jdbcRDD

 thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What does appMasterRpcPort: -1 indicate ?

2014-08-31 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it.

Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to
submit my job in local mode, Spark Standalone mode and YARN mode. I
successfully submitted my job in local mode and Standalone mode, however, I
noticed the following messages printed on console when I submitted my job
in YARN mode:


14/08/29 22:27:29 INFO Client: Submitting application to ASM

14/08/29 22:27:29 INFO YarnClientImpl: Submitted application
application_1406949333981_0015

14/08/29 22:27:29 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:30 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:31 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:32 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:33 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:34 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:35 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:36 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:37 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:38 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: -1

  appStartTime: 1409365649836

  yarnAppState: ACCEPTED


14/08/29 22:27:39 INFO YarnClientSchedulerBackend: Application report from
ASM:

  appMasterRpcPort: 0

  appStartTime: 1409365649836

  yarnAppState: RUNNING


The job finished successfully and produced correct results.
But I'm not sure what those messages mean? Does appMasterRpcPort: -1
indicate an error or exception ?


Re: What does appMasterRpcPort: -1 indicate ?

2014-08-31 Thread Yi Tian
I think -1 means your application master has not been started yet. 


 在 2014年8月31日,23:02,Tao Xiao xiaotao.cs@gmail.com 写道:
 
 I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it.
 
 Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to submit 
 my job in local mode, Spark Standalone mode and YARN mode. I successfully 
 submitted my job in local mode and Standalone mode, however, I noticed the 
 following messages printed on console when I submitted my job in YARN mode:
 
 
 14/08/29 22:27:29 INFO Client: Submitting application to ASM
 
 14/08/29 22:27:29 INFO YarnClientImpl: Submitted application 
 application_1406949333981_0015
 
 14/08/29 22:27:29 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:30 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:31 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:32 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:33 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:34 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:35 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:36 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:37 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:38 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: -1
 
   appStartTime: 1409365649836
 
   yarnAppState: ACCEPTED
 
 
 
 14/08/29 22:27:39 INFO YarnClientSchedulerBackend: Application report from 
 ASM:
 
   appMasterRpcPort: 0
 
   appStartTime: 1409365649836
 
   yarnAppState: RUNNING
 
 
 
 The job finished successfully and produced correct results.
 But I'm not sure what those messages mean? Does appMasterRpcPort: -1 
 indicate an error or exception ?
 
 


Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Steve Lewis
Is there a sample of how to do this -
I see 1.1 is out but cannot find samples of mapPartitions
A Java sample would be very useful


On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 In 1.1, you'll be able to get all of these properties using sortByKey, and
 then mapPartitions on top to iterate through the key-value pairs.
 Unfortunately sortByKey does not let you control the Partitioner, but it's
 fairly easy to write your own version that does if this is important.

 In previous versions, the values for each key had to fit in memory (though
 we could have data on disk across keys), and this is still true for
 groupByKey, cogroup and join. Those restrictions will hopefully go away in
 a later release. But sortByKey + mapPartitions lets you just iterate
 through the key-value pairs without worrying about this.

 Matei

 On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com)
 wrote:

  When programming in Hadoop it is possible to guarantee
 1) All keys sent to a specific partition will be handled by the same
 machine (thread)
 2) All keys received by a specific machine (thread) will be received in
 sorted order
 3) These conditions will hold even if the values associated with a
 specific key are too large enough to fit in memory.

 In my Hadoop code I use all of these conditions - specifically with my
 larger data sets the size of data I wish to group exceeds the available
 memory.

 I think I understand the operation of groupby but my understanding is that
 this requires that the results for a single key, and perhaps all keys fit
 on a single machine.

 Is there away to perform like Hadoop ad not require that an entire group
 fir in memory?




-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Koert Kuipers
matei,
it is good to hear that the restriction that keys need to fit in memory no
longer applies to combineByKey. however join requiring keys to fit in
memory is still a  big deal to me. does it apply to both sides of the join,
or only one (while othe other side is streaming)?


On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 In 1.1, you'll be able to get all of these properties using sortByKey, and
 then mapPartitions on top to iterate through the key-value pairs.
 Unfortunately sortByKey does not let you control the Partitioner, but it's
 fairly easy to write your own version that does if this is important.

 In previous versions, the values for each key had to fit in memory (though
 we could have data on disk across keys), and this is still true for
 groupByKey, cogroup and join. Those restrictions will hopefully go away in
 a later release. But sortByKey + mapPartitions lets you just iterate
 through the key-value pairs without worrying about this.

 Matei

 On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com)
 wrote:

  When programming in Hadoop it is possible to guarantee
 1) All keys sent to a specific partition will be handled by the same
 machine (thread)
 2) All keys received by a specific machine (thread) will be received in
 sorted order
 3) These conditions will hold even if the values associated with a
 specific key are too large enough to fit in memory.

 In my Hadoop code I use all of these conditions - specifically with my
 larger data sets the size of data I wish to group exceeds the available
 memory.

 I think I understand the operation of groupby but my understanding is that
 this requires that the results for a single key, and perhaps all keys fit
 on a single machine.

 Is there away to perform like Hadoop ad not require that an entire group
 fir in memory?




Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-31 Thread RodrigoB
Hi Yana,

You are correct. What needs to be added is that besides RDDs being
checkpointed, metadata which represents execution of computations are also
checkpointed in Spark Streaming.

Upon driver recovery, the last batches (the ones already executed and the
ones that should have been executed while shut down) are recomputed. This is
very good if we just want to recover state and if we don't have any other
component or data store depending on Spark's output. 
In the case we do have that requirement (which is my case) all the nodes
will re-execute all that IO provoking overall system inconsistency as the
outside system were not expecting events from the past.

We need some way of making Spark aware of which computations are
recomputations and which are not so we can empower Spark developers to
introduce specific logic if they need to.

Let me know if any of this doesn't make sense.

tnks,
Rod 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13205.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



This always tries to connect to HDFS: user$ export MASTER=local[NN]; pyspark --master local[NN] ...

2014-08-31 Thread didata
Hello friends:

I use the Cloudera/CDH5 version of Spark (v1.0.0 Spark RPMs), but the
following is also true when
using the Apache Spark distribution built against a locally installed
Hadoop/YARN installation.

The problem:

If the following directory exists, */etc/hadoop/conf/*, and the pertinent
'*.xml' files within it for
*HDFS* are configured to use host, say, /*namenode*/ as the HDFS namenode,
then no
matter how I *locally* invoke pyspark on the command line, it always tries
to connect to */namenode/*,
which I don't always want because I don't always have HDFS running.

In other words, the following always experiences an exception when it cannot
connect to HDFS:

user$ *export MASTER=local[NN]; pyspark --master local[NN]*

The only work-around I've found to this, is to do the following, which is
not good at all:

user$ *(cd /etc/hadoop; sudo mv conf _conf); export MASTER=local[NN];
pyspark --master local[NN]*

Without temporarily moving the Hadoop/YARN configuration directory, how do I
dynamcally instruct
pyspark on the CLI to not use HDFS? (i.e. without hard-codes anywhere, such
as in
*/etc/spark/spark-env.sh*)

Thank you in advance!
didata staff



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/This-always-tries-to-connect-to-HDFS-user-export-MASTER-local-NN-pyspark-master-local-NN-tp13207.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Low Level Kafka Consumer for Spark

2014-08-31 Thread RodrigoB
Just a comment on the recovery part. 

Is it correct to say that currently Spark Streaming recovery design does not
consider re-computations (upon metadata lineage recovery) that depend on
blocks of data of the received stream?
https://issues.apache.org/jira/browse/SPARK-1647

Just to illustrate a real use case (mine): 
- We have object states which have a Duration field per state which is
incremented on every batch interval. Also this object state is reset to 0
upon incoming state changing events. Let's supposed there is at least one
event since the last data checkpoint. This will lead to inconsistency upon
driver recovery: The Duration field will get incremented from the data
checkpoint version until the recovery moment, but the state change event
will never be re-processed...so in the end we have the old state with the
wrong Duration value.
To make things worst, let's imagine we're dumping the Duration increases
somewhere...which means we're spreading the problem across our system.
Re-computation awareness is something I've commented on another thread and
rather treat it separately.
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-td12568.html#a13205

Re-computations do occur, but the only RDD's that are recovered are the ones
from the data checkpoint. This is what we've seen. Is not enough by itself
to ensure recovery of computed data and this partial recovery leads to
inconsistency in some cases. 

Roger - I share the same question with you - I'm just not sure if the
replicated data really gets persisted on every batch. The execution lineage
is checkpointed, but if we have big chunks of data being consumed to
Receiver node on let's say a second bases then having it persisted to HDFS
every second could be a big challenge for keeping JVM performance - maybe
that could be reason why it's not really implemented...assuming it isn't.

Dibyendu had a great effort with the offset controlling code but the general
state consistent recovery feels to me like another big issue to address.

I plan on having a dive into the Streaming code and try to at least
contribute with some ideas. Some more insight from anyone on the dev team
will be very appreciated.

tnks,
Rod 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p13208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: This always tries to connect to HDFS: user$ export MASTER=local[NN]; pyspark --master local[NN] ...

2014-08-31 Thread Sean Owen
I think you're saying it's looking for /foo on HDFS and not on your
local file system?

If so, I would suggest to either prefix your local paths with file:
to be unambiguous, or unset HADOOP_HOME and HADOOP_CONF_DIR

On Sun, Aug 31, 2014 at 10:17 PM, didata subscripti...@didata.us wrote:
 Hello friends:

 I use the Cloudera/CDH5 version of Spark (v1.0.0 Spark RPMs), but the
 following is also true when
 using the Apache Spark distribution built against a locally installed
 Hadoop/YARN installation.

 The problem:

 If the following directory exists, */etc/hadoop/conf/*, and the pertinent
 '*.xml' files within it for
 *HDFS* are configured to use host, say, /*namenode*/ as the HDFS namenode,
 then no
 matter how I *locally* invoke pyspark on the command line, it always tries
 to connect to */namenode/*,
 which I don't always want because I don't always have HDFS running.

 In other words, the following always experiences an exception when it cannot
 connect to HDFS:

 user$ *export MASTER=local[NN]; pyspark --master local[NN]*

 The only work-around I've found to this, is to do the following, which is
 not good at all:

 user$ *(cd /etc/hadoop; sudo mv conf _conf); export MASTER=local[NN];
 pyspark --master local[NN]*

 Without temporarily moving the Hadoop/YARN configuration directory, how do I
 dynamcally instruct
 pyspark on the CLI to not use HDFS? (i.e. without hard-codes anywhere, such
 as in
 */etc/spark/spark-env.sh*)

 Thank you in advance!
 didata staff



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/This-always-tries-to-connect-to-HDFS-user-export-MASTER-local-NN-pyspark-master-local-NN-tp13207.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
Just to be clear, no operation requires all the keys to fit in memory, only the 
values for each specific key. All the values for each individual key need to 
fit, but the system can spill to disk across keys. Right now it's for both 
sides of it, unless you do a broadcast join by hand with something like 
mapPartitions.

Matei

On August 31, 2014 at 12:44:26 PM, Koert Kuipers (ko...@tresata.com) wrote:

matei,
it is good to hear that the restriction that keys need to fit in memory no 
longer applies to combineByKey. however join requiring keys to fit in memory is 
still a  big deal to me. does it apply to both sides of the join, or only one 
(while othe other side is streaming)?


On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
In 1.1, you'll be able to get all of these properties using sortByKey, and then 
mapPartitions on top to iterate through the key-value pairs. Unfortunately 
sortByKey does not let you control the Partitioner, but it's fairly easy to 
write your own version that does if this is important.

In previous versions, the values for each key had to fit in memory (though we 
could have data on disk across keys), and this is still true for groupByKey, 
cogroup and join. Those restrictions will hopefully go away in a later release. 
But sortByKey + mapPartitions lets you just iterate through the key-value pairs 
without worrying about this.

Matei

On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com) wrote:

When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same machine 
(thread)
2) All keys received by a specific machine (thread) will be received in sorted 
order
3) These conditions will hold even if the values associated with a specific key 
are too large enough to fit in memory.

In my Hadoop code I use all of these conditions - specifically with my larger 
data sets the size of data I wish to group exceeds the available memory.

I think I understand the operation of groupby but my understanding is that this 
requires that the results for a single key, and perhaps all keys fit on a 
single machine.

Is there away to perform like Hadoop ad not require that an entire group fir in 
memory?




Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
mapPartitions just gives you an Iterator of the values in each partition, and 
lets you return an Iterator of outputs. For instance, take a look at 
https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java#L694.

Matei

On August 31, 2014 at 12:26:51 PM, Steve Lewis (lordjoe2...@gmail.com) wrote:

Is there a sample of how to do this -
I see 1.1 is out but cannot find samples of mapPartitions
A Java sample would be very useful 


On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
In 1.1, you'll be able to get all of these properties using sortByKey, and then 
mapPartitions on top to iterate through the key-value pairs. Unfortunately 
sortByKey does not let you control the Partitioner, but it's fairly easy to 
write your own version that does if this is important.

In previous versions, the values for each key had to fit in memory (though we 
could have data on disk across keys), and this is still true for groupByKey, 
cogroup and join. Those restrictions will hopefully go away in a later release. 
But sortByKey + mapPartitions lets you just iterate through the key-value pairs 
without worrying about this.

Matei

On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com) wrote:

When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same machine 
(thread)
2) All keys received by a specific machine (thread) will be received in sorted 
order
3) These conditions will hold even if the values associated with a specific key 
are too large enough to fit in memory.

In my Hadoop code I use all of these conditions - specifically with my larger 
data sets the size of data I wish to group exceeds the available memory.

I think I understand the operation of groupby but my understanding is that this 
requires that the results for a single key, and perhaps all keys fit on a 
single machine.

Is there away to perform like Hadoop ad not require that an entire group fir in 
memory?




--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com



numpy digitize

2014-08-31 Thread filipus
hi Folks

is there a function in spark like numpy digitize with discretize a
numerical variable

or even better

is there a way to use the functionality of the decission tree builder of
spark mllib which splits data into bins in such a way that the splitted
variable mostly predict the target value (Label)

could be useful for logistic Regression because it (linearization) makes
models kind of stable in a way

some People would refer it to weight of evidence modeling



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/numpy-digitize-tp13212.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What does appMasterRpcPort: -1 indicate ?

2014-08-31 Thread Tao Xiao
Thanks Yi, I think your answers make sense.

We can see a series of messages with appMasterRpcPort: -1 followed by a
message with appMasterRpcPort: 0, perhaps that means we were waiting for
the application master to be started (appMasterRpcPort: -1), and later
the application master got started (appMasterRpcPort: 0).


2014-08-31 23:10 GMT+08:00 Yi Tian tianyi.asiai...@gmail.com:

 I think -1 means your application master has not been started yet.


 在 2014年8月31日,23:02,Tao Xiao xiaotao.cs@gmail.com 写道:

 I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it.

 Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to
 submit my job in local mode, Spark Standalone mode and YARN mode. I
 successfully submitted my job in local mode and Standalone mode, however, I
 noticed the following messages printed on console when I submitted my job
 in YARN mode:


 14/08/29 22:27:29 INFO Client: Submitting application to ASM

 14/08/29 22:27:29 INFO YarnClientImpl: Submitted application
 application_1406949333981_0015

 14/08/29 22:27:29 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:30 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:31 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:32 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:33 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:34 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:35 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:36 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:37 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:38 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: -1

   appStartTime: 1409365649836

   yarnAppState: ACCEPTED


 14/08/29 22:27:39 INFO YarnClientSchedulerBackend: Application report from
 ASM:

   appMasterRpcPort: 0

   appStartTime: 1409365649836

   yarnAppState: RUNNING


 The job finished successfully and produced correct results.
 But I'm not sure what those messages mean? Does appMasterRpcPort: -1
 indicate an error or exception ?





Spark+OpenCV: Real Time Image Processing

2014-08-31 Thread Varuzhan
Hi everybody!
Now I'm doing something like this:
1) User is uploading an image to server
2) Server is working with that image using of DataBase and Java + OpenCV
3) Server Returns some generated result to user
That is slow now, and if there will be many users, it will work slower and
maybe will not work at all.
Now I want to make all this in Real Time with Spark
I have ready a Cluster (1 Master, 2 Slaves) with Spark and simple Scala
MapReduce test is passed.
Can you give me an idea what I need, to make my Java code (which is working
with image and gives the result) working in Scala and run all this in Real
Time?
Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-OpenCV-Real-Time-Image-Processing-tp13214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: how to filter value in spark

2014-08-31 Thread Liu, Raymond
You could use cogroup to combine RDDs in one RDD for cross reference processing.

e.g.

a.cogroup(b). filter{case (_, (l,r)) = l.nonEmpty  r.nonEmpty }. map{case 
(k,(l,r)) = (k, l)}

Best Regards,
Raymond Liu

-Original Message-
From: marylucy [mailto:qaz163wsx_...@hotmail.com] 
Sent: Friday, August 29, 2014 9:26 PM
To: Matthew Farrellee
Cc: user@spark.apache.org
Subject: Re: how to filter value in spark

i see it works well,thank you!!!

But in follow situation how to do

var a = sc.textFile(/sparktest/1/).map((_,a))
var b = sc.textFile(/sparktest/2/).map((_,b))
How to get (3,a) and (4,a)


在 Aug 28, 2014,19:54,Matthew Farrellee m...@redhat.com 写道:

 On 08/28/2014 07:20 AM, marylucy wrote:
 fileA=1 2 3 4  one number a line,save in /sparktest/1/
 fileB=3 4 5 6  one number a line,save in /sparktest/2/ I want to get 
 3 and 4
 
 var a = sc.textFile(/sparktest/1/).map((_,1))
 var b = sc.textFile(/sparktest/2/).map((_,1))
 
 a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(prin
 tln)
 
 Error throw
 Scala.MatchError:Null
 PairRDDFunctions.lookup...
 
 the issue is nesting of the b rdd inside a transformation of the a rdd
 
 consider using intersection, it's more idiomatic
 
 a.intersection(b).foreach(println)
 
 but not that intersection will remove duplicates
 
 best,
 
 
 matt
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org
 
B�CB??[��X�剀�X�KK[XZ[
?\�\�][��X�剀�X�P?\���\X?KBY][��[圹[X[??K[XZ[
?\�\�Z[?\���\X?KB�B


HELP! EXPORT DATA FROM HIVE TO SQL SERVER

2014-08-31 Thread churly lin
hi, all:

I am working on hive from spark now. I use sparkSQL(HiveFormSpark) for
calculating data and save the results in hive table.
And now, I need export the results in hive table to sql server. Is
there a way to do this?
Thank you all.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HELP! EXPORT DATA FROM HIVE TO SQL SERVER

2014-08-31 Thread Gordon Wang
try sqoop ?
What do you mean by exporting results to sql server?

On Mon, Sep 1, 2014 at 10:41 AM, churly lin chury...@gmail.com wrote:

 I am working on hive from spark now. I use sparkSQL(HiveFormSpark) for
 calculating data and save the results in hive table.
 And now, I need export the results in hive table to sql server. Is
 there a way to do this?




-- 
Regards
Gordon Wang