RE: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-27 Thread Felix Cheung
May I ask how you are starting Spark?
It looks like PYTHONHASHSEED is being set: 
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED

 
Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I get and exception 'Randomness of hash of string 
should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not 
just set PYTHONHASHSEED ?
Should I file a bug?
Kind regards
Andy
details
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
Example does not work out of the box
Subtract(other, numPartitions=None)Return each value in self that is not 
contained in other.>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 
3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
It raises 
if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via 
PYTHONHASHSEED")


The following script fixes the problem 
Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
Exception'Randomness of hash of string should be disabled via 
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh

sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh  
/root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`

Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done


  

Re: Spark Streaming on mesos

2015-11-27 Thread Nagaraj Chandrashekar
Hi Renjie,

I have not setup Spark Streaming on Mesos but there is something called 
reservations in Mesos.  It supports both Static and Dynamic reservations.  Both 
types of reservations must have role defined. You may want to explore these 
options.   Excerpts from the Apache Mesos documentation.

Cheers
Nagaraj C
Reservation

Mesos provides mechanisms to reserve resources in specific slaves. The concept 
was first introduced with static reservation in 0.14.0 which enabled operators 
to specify the reserved resources on slave startup. This was extended with 
dynamic reservation in 0.23.0 which enabled operators and authorized frameworks 
to dynamically reserve resources in the cluster.

No breaking changes were introduced with dynamic reservation, which means the 
existing static reservation mechanism continues to be fully supported.

In both types of reservations, resources are reserved for a role.

Static Reservation (since 0.14.0)

An operator can configure a slave with resources reserved for a role. The 
reserved resources are specified via the --resources flag. For example, suppose 
we have 12 CPUs and 6144 MB of RAM available on a slave and that we want to 
reserve 8 CPUs and 4096 MB of RAM for the ads role. We start the slave like so:

$ mesos-slave \
  --master=: \
  --resources="cpus:4;mem:2048;cpus(ads):8;mem(ads):4096"


We now have 8 CPUs and 4096 MB of RAM reserved for ads on this slave.


From: Renjie Liu >
Date: Friday, November 27, 2015 at 9:57 PM
To: "user@spark.apache.org" 
>
Subject: Spark Streaming on mesos

Hi, all:
I'm trying to run spark streaming on mesos and it seems that none of the 
scheduler is suitable for that. Fine grain scheduler will start an executor for 
each task so it will significantly increase the latency. While coarse grained 
mode can only set the max core numbers and executor memory but there's no way 
to set the number of cores for each executor. Has anyone deployed spark 
streaming on mesos? And what's your settings?
--
Liu, Renjie
Software Engineer, MVAD


Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-27 Thread Ted Yu
ec2/spark-ec2 calls ./ec2/spark_ec2.py

I don't see PYTHONHASHSEED defined in any of these scripts.

Andy reported this for ec2 cluster.

I think a JIRA should be opened.


On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung 
wrote:

> May I ask how you are starting Spark?
> It looks like PYTHONHASHSEED is being set:
> https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED
>
>
> --
> Date: Thu, 26 Nov 2015 11:30:09 -0800
> Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
> From: a...@santacruzintegration.com
> To: user@spark.apache.org
>
> I am using spark-1.5.1-bin-hadoop2.6. I used
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and
> configured spark-env to use python3. I get and exception 'Randomness of
> hash of string should be disabled via PYTHONHASHSEED’. Is there any
> reason rdd.py should not just set PYTHONHASHSEED ?
>
> Should I file a bug?
>
> Kind regards
>
> Andy
>
> details
>
>
> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
>
> Example does not work out of the box
>
> Subtract(*other*, *numPartitions=None*)
> 
>
> Return each value in self that is not contained in other.
>
> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = 
> >>> sc.parallelize([("a", 3), ("c", None)])>>> 
> >>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>
> It raises
>
> if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
>
>
>
> *The following script fixes the problem *
>
> Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate
> Exception'Randomness of hash of string should be disabled via
> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf
> /spark-env.sh
>
> sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh
> /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`
>
> Sudo for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf
> /spark-env.sh; done
>
>
>
>


Spark Streaming on mesos

2015-11-27 Thread Renjie Liu
Hi, all:
I'm trying to run spark streaming on mesos and it seems that none of the
scheduler is suitable for that. Fine grain scheduler will start an executor
for each task so it will significantly increase the latency. While coarse
grained mode can only set the max core numbers and executor memory but
there's no way to set the number of cores for each executor. Has anyone
deployed spark streaming on mesos? And what's your settings?
-- 
Liu, Renjie
Software Engineer, MVAD


RE: Hive using Spark engine alone

2015-11-27 Thread Mich Talebzadeh
Thanks Jorn for your interest and appreciate your helpful comments.

 

I am primarily interested in this case to make Hive work with Spark engine. It 
may well be that it is work in progress and we have to wait for it.

 

Regards,

 

Mich

 

From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 27 November 2015 14:03
To: Mich Talebzadeh 
Cc: user 
Subject: Re: Hive using Spark engine alone

 

Hi,

 

I recommend to use the latest version of Hive. You may also wait for hive on 
tez with tez version >= 0.8 and hive > 1.2. Before that I recommend first 
trying other optimizations of Hive and have a look at the storage format 
together with storage indexes (not the regular ones), bloom filters, partitions 
 etc

Then, check the data model. A data model using only varchar is pretty useless. 

Afterwards check the execution engine (tez, spark, etc.). Most common Hadoop 
distribution should have at least spark preconfigured. Alternatively, if you do 
not deal with bulk analytics but more single inserts updates, deletes, you may 
want to use an external hbase table. Finally, you can check in-memory caches.

 

Best regards

 

 


On 27 Nov 2015, at 11:43, Mich Talebzadeh  > wrote:

Hi,

 

As a matter of interest has anyone installed and configured Spark to be used as 
the execution engine for Hive please?

 

This is in contrast to install and configure Spark as an application.

 

Spark by default uses MapReduce  as its execution engine which is more for 
batch processing. The primary reason I want to use Hive on Spark engine is for 
performance.

 

Thanks,

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 



RE: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-27 Thread Mich Talebzadeh
Hi,

 

In general YARN is used as the resource scheduler regardless of the execution 
engine whether it is MapReduce or Spark.

 

Yarn will create a resource container for the submitted job (that is the Spark 
client) and will execute it in the default engine (in this case Spark). There 
will be a job scheduler and one or more Spark Executors depending on the 
cluster. So as far as I can see both diagrams are correct,

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Nisrina Luthfiyati [mailto:nisrina.luthfiy...@gmail.com] 
Sent: 27 November 2015 11:12
To: user@spark.apache.org
Subject: In yarn-client mode, is it the driver or application master that issue 
commands to executors?

 

Hi all,

I'm trying to understand how yarn-client mode works and found these two 
diagrams:


  

 

   

In the first diagram, it looks like the driver running in client directly 
communicates with executors to issue application commands, while in the second 
diagram it looks like application commands is sent to application master first 
and then forwarded to executors. 

Would anyone knows which case is true or is there any other interpretation to 
these diagrams?

Thanks!

Nisrina



In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-27 Thread Nisrina Luthfiyati
Hi all,
I'm trying to understand how yarn-client mode works and found these two
diagrams:




In the first diagram, it looks like the driver running in client directly
communicates with executors to issue application commands, while in the
second diagram it looks like application commands is sent to application
master first and then forwarded to executors.

Would anyone knows which case is true or is there any other interpretation
to these diagrams?

Thanks!
Nisrina


RE: RE: error while creating HiveContext

2015-11-27 Thread Chandra Mohan, Ananda Vel Murugan
Hi Sun,

I could connect to Hive in spark command line and run sql commands. So I don’t 
think it is the problem with hive config file.

Regards,
Anand.C

From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Friday, November 27, 2015 3:25 PM
To: Chandra Mohan, Ananda Vel Murugan ; user 

Subject: Re: RE: error while creating HiveContext

Could you provide your hive-site.xml file info ?
Best,
Sun.


fightf...@163.com

From: Chandra Mohan, Ananda Vel Murugan
Date: 2015-11-27 17:04
To: fightf...@163.com; 
user
Subject: RE: error while creating HiveContext
Hi,

I verified and I could see hive-site.xml in spark conf directory.

Regards,
Anand.C

From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Friday, November 27, 2015 12:53 PM
To: Chandra Mohan, Ananda Vel Murugan 
>; user 
>
Subject: Re: error while creating HiveContext

Hi,
I think you just want to put the hive-site.xml in the spark/conf directory and 
it would load
it into spark classpath.

Best,
Sun.


fightf...@163.com

From: Chandra Mohan, Ananda Vel Murugan
Date: 2015-11-27 15:04
To: user
Subject: error while creating HiveContext
Hi,

I am building a spark-sql application in Java. I created a maven project in 
Eclipse and added all dependencies including spark-core and spark-sql. I am 
creating HiveContext in my spark program and then try to run sql queries 
against my Hive Table. When I submit this job in spark, for some reasons it is 
trying to create derby metastore. But my hive-site.xml clearly specifies the 
jdbc url of my MySQL . So I think my hive-site.xml is not getting picked by 
spark program. I specified hive-site.xml path using “—files” argument in 
spark-submit. I also tried placing hive-site.xml file in my jar . I even tried 
creating Configuration object with hive-site.xml path and updated my 
HiveContext by calling addResource() method.

I want to know where I should put hive config files in my jar or in my eclipse 
project or in my cluster for it to be picked by correctly in my spark program.

Thanks for any help.

Regards,
Anand.C



Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-27 Thread Nisrina Luthfiyati
Hi Mich, thank you for the answer. Regarding the diagrams, I'm specifically
referring to the direct line between spark yarn client to spark executor in
the first diagram which implies direct communication to executor when
issuing application commands. And the 'Application commands' & 'Issue
application commands' lines in the second diagram which implies that spark
driver in client communicates to executor via yarn application master
(Correct me if i'm wrong in these interpretations). Would you happen to
know how spark drivers communicates with executor in yarn-client mode or if
both can be true under different circumstances?

Thanks again,
Nisrina.

On Fri, Nov 27, 2015 at 6:22 PM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> In general YARN is used as the resource scheduler regardless of the
> execution engine whether it is MapReduce or Spark.
>
>
>
> Yarn will create a resource container for the submitted job (that is the
> Spark client) and will execute it in the default engine (in this case
> Spark). There will be a job scheduler and one or more Spark Executors
> depending on the cluster. So as far as I can see both diagrams are correct,
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Nisrina Luthfiyati [mailto:nisrina.luthfiy...@gmail.com]
> *Sent:* 27 November 2015 11:12
> *To:* user@spark.apache.org
> *Subject:* In yarn-client mode, is it the driver or application master
> that issue commands to executors?
>
>
>
> Hi all,
>
> I'm trying to understand how yarn-client mode works and found these two
> diagrams:
>
>
>
>
> In the first diagram, it looks like the driver running in client directly
> communicates with executors to issue application commands, while in the
> second diagram it looks like application commands is sent to application
> master first and then forwarded to executors.
>
> Would anyone knows which case is true or is there any other interpretation
> to these diagrams?
>
> Thanks!
>
> Nisrina
>



-- 
Nisrina Luthfiyati - Ilmu Komputer Fasilkom UI 2010
http://www.facebook.com/nisrina.luthfiyati
http://id.linkedin.com/in/nisrina


Re: Hive using Spark engine alone

2015-11-27 Thread Jörn Franke
Hi,

I recommend to use the latest version of Hive. You may also wait for hive on 
tez with tez version >= 0.8 and hive > 1.2. Before that I recommend first 
trying other optimizations of Hive and have a look at the storage format 
together with storage indexes (not the regular ones), bloom filters, partitions 
 etc
Then, check the data model. A data model using only varchar is pretty useless. 
Afterwards check the execution engine (tez, spark, etc.). Most common Hadoop 
distribution should have at least spark preconfigured. Alternatively, if you do 
not deal with bulk analytics but more single inserts updates, deletes, you may 
want to use an external hbase table. Finally, you can check in-memory caches.

Best regards



> On 27 Nov 2015, at 11:43, Mich Talebzadeh  wrote:
> 
> Hi,
>  
> As a matter of interest has anyone installed and configured Spark to be used 
> as the execution engine for Hive please?
>  
> This is in contrast to install and configure Spark as an application.
>  
> Spark by default uses MapReduce  as its execution engine which is more for 
> batch processing. The primary reason I want to use Hive on Spark engine is 
> for performance.
>  
> Thanks,
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>  


Windows shared folder

2015-11-27 Thread Shuo Wang
Hi,

I am trying to build a small home spark cluster on windows.  I have a
question regarding how to share the data files for the master node and
worker nodes to process. The data files are pretty large, a few 100G.

Can I just use windows shared folder as the file path for my driver/master,
and worker nodes, where my worker nodes exist on the same LAN as my
driver/master, and the shared folder is on my master node?

-- 
王硕
邮箱:shuo.x.w...@gmail.com
Whatever your journey, keep walking.


Give parallelize a dummy Arraylist length N to control RDD size?

2015-11-27 Thread Jim

 Hello there,

(part of my problem is docs that say "undocumented" on parallelize 
 
leave me reading books for examples that don't always pertain )


I am trying to create an RDD length N = 10^6 by executing N operations 
of a Java class we have, I can have that class implement Serializable or 
any Function if necessary. I don't have a fixed length dataset up front, 
I am trying to create one. Trying to figure out whether to create a 
dummy array of length N to parallelize, or pass it a function that runs 
N times.


Not sure which approach is valid/better, I see in Spark if I am starting 
out with a well defined data set like words in a doc, the length/count 
of those words is already defined and I just parallelize some map or 
filter to do some operation on that data.


In my case I think it's different, trying to parallelize the creation an 
RDD that will contain 10^6 elements... here's a lot more info if you 
want ...


DESCRIPTION:

In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes a 
PipeLinkageData and returns a DropResult.


I am thinking I could use map() or flatMap() to call a one to many 
function, I was trying to do something like this in another question 
that never quite worked 
:


|JavaRDDsimCountRDD 
=spark.parallelize(makeRange(1,getSimCount())).map(newFunction(){publicDropResultcall(Integeri){returnpld.doDrop();}});|


Thinking something like this is more the correct approach? And this has 
more context if desired:


|// pld is of type PipeLinkageData, it's already initialized// 
parallelize wants a collection passed into first 
paramListpldListofOne =newArrayList();// make an 
ArrayList of onepldListofOne.add(pld);inthowMany 
=100;JavaRDDnSizedRDD 
=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction(){publicIterablecall(PipeLinkageDatapld){ListreturnRDD 
=newArrayList();// is Spark good at spreading a for loop like 
this?for(inti =0;i ;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|


One other concern: A JavaRDD is corrrect here? I can see needing to call 
FlatMapFunction but I don't need a FlatMappedRDD? And since I am never 
trying to flatten a group of arrays or lists to a single array or list, 
do I really ever need to flatten anything?








Cant start master on windows 7

2015-11-27 Thread Shuo Wang
Hi,

I am trying to use the start-master.sh script on windows 7. But it failed
to start master, and give the following error,

ps: unknown option -- o
Try `ps --help' for more information.
starting org.apache.spark.deploy.master.Master, logging to
/c/spark-1.5.2-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-My-PC.out
ps: unknown option -- o
Try `ps --help' for more information.
failed to launch org.apache.spark.deploy.master.Master:
  Spark Command: C:\Program Files (x86)\Java\jre1.8.0_60\bin\java -cp
C:/spark-1.5.2-bin-hadoop2.6/sbin/../conf\;C:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar
-Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip My-PC --port 7077
--webui-port 8080
  
full log in
/c/spark-1.5.2-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-My-PC.out


any idea what is going wrong here?



-- 
王硕
邮箱:shuo.x.w...@gmail.com
Whatever your journey, keep walking.


Re: Cant start master on windows 7

2015-11-27 Thread Ted Yu
Have you checked the contents of spark--org.apache.spark.deploy.master.
Master-1-My-PC.out ?

Cheers

On Fri, Nov 27, 2015 at 7:27 AM, Shuo Wang  wrote:

> Hi,
>
> I am trying to use the start-master.sh script on windows 7. But it failed
> to start master, and give the following error,
>
> ps: unknown option -- o
> Try `ps --help' for more information.
> starting org.apache.spark.deploy.master.Master, logging to
> /c/spark-1.5.2-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-My-PC.out
> ps: unknown option -- o
> Try `ps --help' for more information.
> failed to launch org.apache.spark.deploy.master.Master:
>   Spark Command: C:\Program Files (x86)\Java\jre1.8.0_60\bin\java -cp
> C:/spark-1.5.2-bin-hadoop2.6/sbin/../conf\;C:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar
> -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip My-PC --port 7077
> --webui-port 8080
>   
> full log in
> /c/spark-1.5.2-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-My-PC.out
>
>
> any idea what is going wrong here?
>
>
>
> --
> 王硕
> 邮箱:shuo.x.w...@gmail.com
> Whatever your journey, keep walking.
>


Re: Cant start master on windows 7

2015-11-27 Thread Shuo Wang
Hi,

yeah, not much there actually, like the following.

Spark Command: C:\Program Files (x86)\Java\jre1.8.0_60\bin\java -cp
C:/spark-1.5.2-bin-hadoop2.6/sbin/../conf\;C:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar
-Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip My-PC --port 7077
--webui-port 8080


On Fri, Nov 27, 2015 at 11:30 PM, Ted Yu  wrote:

> Have you checked the contents of spark--org.apache.spark.deploy.master.
> Master-1-My-PC.out ?
>
> Cheers
>
> On Fri, Nov 27, 2015 at 7:27 AM, Shuo Wang  wrote:
>
>> Hi,
>>
>> I am trying to use the start-master.sh script on windows 7. But it failed
>> to start master, and give the following error,
>>
>> ps: unknown option -- o
>> Try `ps --help' for more information.
>> starting org.apache.spark.deploy.master.Master, logging to
>> /c/spark-1.5.2-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-My-PC.out
>> ps: unknown option -- o
>> Try `ps --help' for more information.
>> failed to launch org.apache.spark.deploy.master.Master:
>>   Spark Command: C:\Program Files (x86)\Java\jre1.8.0_60\bin\java -cp
>> C:/spark-1.5.2-bin-hadoop2.6/sbin/../conf\;C:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.5.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar
>> -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip My-PC --port 7077
>> --webui-port 8080
>>   
>> full log in
>> /c/spark-1.5.2-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-My-PC.out
>>
>>
>> any idea what is going wrong here?
>>
>>
>>
>> --
>> 王硕
>> 邮箱:shuo.x.w...@gmail.com
>> Whatever your journey, keep walking.
>>
>
>


-- 
王硕
邮箱:shuo.x.w...@gmail.com
Whatever your journey, keep walking.


Re: RE: error while creating HiveContext

2015-11-27 Thread fightf...@163.com
Could you provide your hive-site.xml file info ? 
Best,
Sun.



fightf...@163.com
 
From: Chandra Mohan, Ananda Vel Murugan
Date: 2015-11-27 17:04
To: fightf...@163.com; user
Subject: RE: error while creating HiveContext
Hi, 
 
I verified and I could see hive-site.xml in spark conf directory. 
 
Regards,
Anand.C
 
From: fightf...@163.com [mailto:fightf...@163.com] 
Sent: Friday, November 27, 2015 12:53 PM
To: Chandra Mohan, Ananda Vel Murugan ; user 

Subject: Re: error while creating HiveContext
 
Hi, 
I think you just want to put the hive-site.xml in the spark/conf directory and 
it would load 
it into spark classpath.
 
Best,
Sun.
 


fightf...@163.com
 
From: Chandra Mohan, Ananda Vel Murugan
Date: 2015-11-27 15:04
To: user
Subject: error while creating HiveContext
Hi, 
 
I am building a spark-sql application in Java. I created a maven project in 
Eclipse and added all dependencies including spark-core and spark-sql. I am 
creating HiveContext in my spark program and then try to run sql queries 
against my Hive Table. When I submit this job in spark, for some reasons it is 
trying to create derby metastore. But my hive-site.xml clearly specifies the 
jdbc url of my MySQL . So I think my hive-site.xml is not getting picked by 
spark program. I specified hive-site.xml path using “—files” argument in 
spark-submit. I also tried placing hive-site.xml file in my jar . I even tried 
creating Configuration object with hive-site.xml path and updated my 
HiveContext by calling addResource() method.   
 
I want to know where I should put hive config files in my jar or in my eclipse 
project or in my cluster for it to be picked by correctly in my spark program. 
 
Thanks for any help. 
 
Regards,
Anand.C
 


RE: error while creating HiveContext

2015-11-27 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

I verified and I could see hive-site.xml in spark conf directory.

Regards,
Anand.C

From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Friday, November 27, 2015 12:53 PM
To: Chandra Mohan, Ananda Vel Murugan ; user 

Subject: Re: error while creating HiveContext

Hi,
I think you just want to put the hive-site.xml in the spark/conf directory and 
it would load
it into spark classpath.

Best,
Sun.


fightf...@163.com

From: Chandra Mohan, Ananda Vel Murugan
Date: 2015-11-27 15:04
To: user
Subject: error while creating HiveContext
Hi,

I am building a spark-sql application in Java. I created a maven project in 
Eclipse and added all dependencies including spark-core and spark-sql. I am 
creating HiveContext in my spark program and then try to run sql queries 
against my Hive Table. When I submit this job in spark, for some reasons it is 
trying to create derby metastore. But my hive-site.xml clearly specifies the 
jdbc url of my MySQL . So I think my hive-site.xml is not getting picked by 
spark program. I specified hive-site.xml path using “—files” argument in 
spark-submit. I also tried placing hive-site.xml file in my jar . I even tried 
creating Configuration object with hive-site.xml path and updated my 
HiveContext by calling addResource() method.

I want to know where I should put hive config files in my jar or in my eclipse 
project or in my cluster for it to be picked by correctly in my spark program.

Thanks for any help.

Regards,
Anand.C



Re: thought experiment: use spark ML to real time prediction

2015-11-27 Thread Nick Pentreath
Yup, I agree that Spark (or whatever other ML system) should be focused on
model training rather than real-time scoring. And yes, in most cases
trained models easily fit on a single machine. I also agree that, while
there may be a few use cases out there, Spark Streaming is generally not
well-suited for real-time model scoring. It can be nicely suited for near
real-time model training / updating however.

The thing about models is, once they're built, they are quite standard -
hence standards for I/O and scoring such as PMML. They should ideally also
be completely portable across languages and frameworks - a model trained in
Spark should be usable in a JVM web server, a Python app, a JavaScript AWS
lambda function, etc etc.

The challenge is actually not really "prediction" - which is usually a
simple dot product or matrix operation (or tree walk, or whatever), easily
handled by whatever linear algebra library you are using. It is instead
encapsulating the entire pipeline from raw(-ish) data through
transformations to predictions. As well as versioning, performance
monitoring and online evaluation, A/B testing etc etc.

I guess the point I'm trying to make is that, while it's certainly possible
to create "non-Spark" usable models (e.g. using a spark-ml-common library
or whatever), this only solves a portion of the problem. Now, it may be a
good idea to solve that portion of the problem and leave the rest for
users' own implementation to suit their needs. But I think there is a big
missing piece here that seems like it needs to be filled in the Spark, and
general ML, community.

PMML and related projects such as OpenScoring, or projects like
PredictionIO, seek to solve the problem. PFA seems like a very interesting
potential solution, but it is very young still.

So the question to me is - what is the most efficient way to solve the
problem? I guess for now it may be either something like "spark-ml-common",
or extending PMML support (or both). Perhaps in the future something like
PFA.

It would be interesting to hear more user experiences and what they are
using for serving architectures, how they are handling model
import/export/deployment, etc.

On Sun, Nov 22, 2015 at 8:33 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Nick
>
> I started this thread. IMHO we need something like spark to train our
> models. The resulting model are typically small enough to easily fit on a
> single machine. My real time production system is not built on spark. The
> real time system needs to use the model to make predictions in real time.
>
>
> User case: “high frequency stock training”. Use spark to train a model.
> There is no way I could use spark streaming in the real time production
> system. I need some way to easily move the model trained using spark to a
> non spark environment so I can make predictions in real time.
>
> “credit card Fraud detection” is another similar use case.
>
> Kind regards
>
> Andy
>
>
>
>
> From: Nick Pentreath 
> Date: Wednesday, November 18, 2015 at 4:03 AM
> To: DB Tsai 
> Cc: "user @spark" 
>
> Subject: Re: thought experiment: use spark ML to real time prediction
>
> One such "lightweight PMML in JSON" is here -
> https://github.com/bigmlcom/json-pml. At least for the schema
> definitions. But nothing available in terms of evaluation/scoring. Perhaps
> this is something that can form a basis for such a new undertaking.
>
> I agree that distributed models are only really applicable in the case of
> massive scale factor models - and then anyway for latency purposes one
> needs to use LSH or something similar to achieve sufficiently real-time
> performance. These days one can easily spin up a single very powerful
> server to handle even very large models.
>
> On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai  wrote:
>
>> I was thinking about to work on better version of PMML, JMML in JSON, but
>> as you said, this requires a dedicated team to define the standard which
>> will be a huge work.  However, option b) and c) still don't address the
>> distributed models issue. In fact, most of the models in production have to
>> be small enough to return the result to users within reasonable latency, so
>> I doubt how usefulness of the distributed models in real production
>> use-case. For R and Python, we can build a wrapper on-top of the
>> lightweight "spark-ml-common" project.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath > > wrote:
>>
>>> I think the issue with pulling in all of spark-core is often with
>>> dependencies (and versions) conflicting with the web framework (or Akka in
>>> many cases). Plus it really is quite heavy if you just want a fairly
>>> lightweight model-serving app. For example we've built a 

how to using local repository in spark[dev]

2015-11-27 Thread lihu
Hi, All:

 I modify the spark code and try to use some extra jars in Spark, the
extra jars is published in my local maven repository using* mvn install*.
 However the sbt can not find this jars file, even I can find this jar
fils under* /home/myname/.m2/resposiroty*.
I can guarantee that the local m2 repository is added in the resolvers,
because I get the following resolvers using *show resolvers* command.


*List(central: https://repo1.maven.org/maven2
, apache-repo:
https://repository.apache.org/content/repositories/releases
, jboss-repo:
https://repository.jboss.org/nexus/content/repositories/releases
,
mqtt-repo: https://repo.eclipse.org/content/repositories/paho-releases
,
cloudera-repo: https://repository.cloudera.com/artifactory/cloudera-repos
,
spark-hive-staging:
https://oss.sonatype.org/content/repositories/orgspark-project-1113
,
mapr-repo: http://repository.mapr.com/maven/
, spring-releases:
https://repo.spring.io/libs-release ,
twttr-repo: http://maven.twttr.com ,
apache.snapshots: http://repository.apache.org/snapshots
, cache:Maven2 Local:
/home/myname/.m2/repository)*


Does anyone know how to deal with this. In fact, some days ago this can
work, but after update my custom jar file and install again recently, it
can not work now.

Environment: spark1.5  sbt 0.13.7/0.13.9


Re: thought experiment: use spark ML to real time prediction

2015-11-27 Thread Nick Pentreath
Thanks for that link Vincenzo. PFA definitely seems interesting - though I
see it is quite wide in scope, almost like its own mini math/programming
language.

Do you know if there are any reference implementations in code? I don't see
any on the web site or the DMG github.

On Sun, Nov 22, 2015 at 2:24 PM, Vincenzo Selvaggio 
wrote:

> The Data Mining Group (http://dmg.org/) that created PMML are working on
> a new standard called PFA that indeed uses JSON documents, see
> http://dmg.org/pfa/docs/motivation/ for details.
>
> PFA could be the answer to your option c.
>
> Regards,
> Vincenzo
>
>
> On Wed, Nov 18, 2015 at 12:03 PM, Nick Pentreath  > wrote:
>
>> One such "lightweight PMML in JSON" is here -
>> https://github.com/bigmlcom/json-pml. At least for the schema
>> definitions. But nothing available in terms of evaluation/scoring. Perhaps
>> this is something that can form a basis for such a new undertaking.
>>
>> I agree that distributed models are only really applicable in the case of
>> massive scale factor models - and then anyway for latency purposes one
>> needs to use LSH or something similar to achieve sufficiently real-time
>> performance. These days one can easily spin up a single very powerful
>> server to handle even very large models.
>>
>> On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai  wrote:
>>
>>> I was thinking about to work on better version of PMML, JMML in JSON,
>>> but as you said, this requires a dedicated team to define the standard
>>> which will be a huge work.  However, option b) and c) still don't address
>>> the distributed models issue. In fact, most of the models in production
>>> have to be small enough to return the result to users within reasonable
>>> latency, so I doubt how usefulness of the distributed models in real
>>> production use-case. For R and Python, we can build a wrapper on-top of the
>>> lightweight "spark-ml-common" project.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 I think the issue with pulling in all of spark-core is often with
 dependencies (and versions) conflicting with the web framework (or Akka in
 many cases). Plus it really is quite heavy if you just want a fairly
 lightweight model-serving app. For example we've built a fairly simple but
 scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
 really need is the web framework and Breeze (or an alternative linear
 algebra lib).

 I definitely hear the pain-point that PMML might not be able to handle
 some types of transformations or models that exist in Spark. However,
 here's an example from scikit-learn -> PMML that may be instructive (
 https://github.com/scikit-learn/scikit-learn/issues/1596 and
 https://github.com/jpmml/jpmml-sklearn), where a fairly impressive
 list of estimators and transformers are supported (including e.g. scaling
 and encoding, and PCA).

 I definitely think the current model I/O and "export" or "deploy to
 production" situation needs to be improved substantially. However, you are
 left with the following options:

 (a) build out a lightweight "spark-ml-common" project that brings in
 the dependencies needed for production scoring / transformation in
 independent apps. However, here you only support Scala/Java - what about R
 and Python? Also, what about the distributed models? Perhaps "local"
 wrappers can be created, though this may not work for very large factor or
 LDA models. See also H20 example
 http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html

 (b) build out Spark's PMML support, and add missing stuff to PMML where
 possible. The benefit here is an existing standard with various tools for
 scoring (via REST server, Java app, Pig, Hive, various language support).

 (c) build out a more comprehensive I/O, serialization and scoring
 framework. Here you face the issue of supporting various predictors and
 transformers generically, across platforms and versioning. i.e. you're
 re-creating a new standard like PMML

 Option (a) is do-able, but I'm a bit concerned that it may be too
 "Spark specific", or even too "Scala / Java" specific. But it is still
 potentially very useful to Spark users to build this out and have a
 somewhat standard production serving framework and/or library (there are
 obviously existing options like PredictionIO etc).

 Option (b) is really building out the existing PMML support within
 Spark, so a lot of the initial work has already been done. I know some
 folks had (or have) licensing issues with some components of JPMML (e.g.
 the evaluator and REST 

Re: Millions of entities in custom Hadoop InputFormat and broadcast variable

2015-11-27 Thread Jeff Zhang
Where do you load all IDs of your dataset ? In your custom
InputFormat#getSplits ?  getSplits will be invoked in driver side to build
the Partition which will be serialized to executor as part of the task.

Do you put all the ids in the InputSplit ? That would make it pretty large.

In your case, I think you can load the ids directly rather than creating
custom Hadoop InputFormat.  e.g.

sc.textFile(id_file, 100).map(load data using the id)

Please make sure use a high partition number ( I use 100 here) in
sc.textFile to get high parallelism.

On Fri, Nov 27, 2015 at 2:06 PM, Anfernee Xu  wrote:

> Hi Spark experts,
>
> First of all, happy Thanksgiving!
>
> The comes to my question, I have implemented custom Hadoop InputFormat to
> load millions of entities from my data source to Spark(as JavaRDD and
> transform to DataFrame). The approach I took in implementing the custom
> Hadoop RDD is loading all ID's of my data entity(each entity has an unique
> ID: Long) and split the ID list(contains 3 millions of Long number for
> example) into configured splits, each split contains a sub-set of ID's, in
> turn my custom RecordReader will load the full entity(a plain Java Bean)
> from my data source for each ID in the specific split.
>
> My first observation is some Spark tasks were timeout, and looks like
> Spark broadcast variable is being used to distribute my splits, is that
> correct? If so, from performance perspective, what enhancement I can make
> to make it better?
>
> Thanks
>
> --
> --Anfernee
>



-- 
Best Regards

Jeff Zhang


Re: WARN MemoryStore: Not enough space

2015-11-27 Thread Gylfi
"spark.storage.memoryFraction 0.05"  
If you want to store a lot of memory I think this must be a higher fraction.
The default is 0.6 (not 0.0X). 


To change the output directory you can set "spark.local.dir=/path/to/dir"
and you can even specify multiple directories (for example if you have
multiple mounted devices) by using a ',' between the paths. 

I.e. "spark.local.dir=/path/to/dir1,/path/to/dir2"  or --conf
"spark.local.dir=/mnt/,/mnt2/" (at launch time). 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/WARN-MemoryStore-Not-enough-space-tp25492p25500.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Hive using Spark engine alone

2015-11-27 Thread Mich Talebzadeh
Hi,

 

As a matter of interest has anyone installed and configured Spark to be used
as the execution engine for Hive please?

 

This is in contrast to install and configure Spark as an application.

 

Spark by default uses MapReduce  as its execution engine which is more for
batch processing. The primary reason I want to use Hive on Spark engine is
for performance.

 

Thanks,

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.