date:20141222

Spark SQL 1.2 with CDH 4, Hive UDF is not working.

2014-12-22 Thread Ji ZHANG

Hi,

Recently I'm migrating from Shark 0.9 to Spark SQL 1.2, my CDH version
is 4.5, Hive 0.11. I've managed to setup Spark SQL Thriftserver, and
normal queries work fine, but custom UDF is not usable.

The symptom is when executing CREATE TEMPORARY FUNCTION, the query
hangs on a lock request:

14/12/22 14:41:57 DEBUG ClientCnxn: Reading reply
sessionid:0x34a6121e6d93e74, packet:: clientPath:null serverPath:null
finished:false header:: 289,8  replyHeader:: 289,51540866762,0
request:: '/hive_zookeeper_namespace_hive1/default,F  response::
v{'sample_07,'LOCK-EXCLUSIVE-0001565612,'LOCK-EXCLUSIVE-0001565957}
14/12/22 14:41:57 ERROR ZooKeeperHiveLockManager: conflicting lock
present for default mode EXCLUSIVE

Is it a compatibility issue because Spark SQL 1.2 is based on Hive
0.13? Is there a workaround instead of upgrading CDH or forbidding UDF
on Spark SQL?

Thanks.

-- 
Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

what is the default log4j configuration passed to yarn container

2014-12-22 Thread Venkata ramana gollamudi

Hi,

In case of MR task the log4j configuration and container log folder for a 
container is explicitly set in the container Launch context by 
org.apache.hadoop.mapreduce.v2.util.MRApps.addLog4jSystemProperties i.e from 
MapReduce YARN client code and not YARN code. 
This is also visible from jinfo on the MR's container's pid 
-Dlog4j.configuration=container-log4j.properties 
-Dyarn.app.container.log.dir=container log 
dir/application_1413959638984_0032/container_1413959638984_0032_01_01 
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx1024m


But in case of spark we don't see the 
log4j.configuration=container-log4j.propertiesto be set by default then 
-XX:OnOutOfMemoryError=kill %p -Xms512m -Xmx512m -Djava.io.tmpdir=local 
dir/usercache/root/appcache/application_1413959638984_0027/container_1413959638984_0027_01_03/tmp
 -Dlog4j.configuration.watch=false -Dspark.akka.timeout=100 
-Dspark.akka.frameSize=10 -Dspark.akka.heartbeat.pauses=600 
-Dspark.akka.threads=4 -Dspark.yarn.app.container.log.dir=container log 
dir/application_1413959638984_0027/container_1413959638984_0027_01_03


If the user doesn't set custom log4j configuration, how default spark container 
log4j settings are done ? Is any Log4j configuration file is set in the class 
path/jars or the log4j appender is programmatically set ? As the container log 
folder is different for each container, static log4j cannot be used so is 
spark.yarn.app.container.log.dir might be getting set in the log4j.properties 
but not clear who or where its set, can anyone give more information on this ?

Regards,
Ramana
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL 1.2 with CDH 4, Hive UDF is not working.

2014-12-22 Thread Cheng Lian


Hi Ji,

Spark SQL 1.2 only works with either Hive 0.12.0 or 0.13.1 due to Hive 
API/protocol compatibility issues. When interacting with Hive 0.11.x, 
connections and simple queries may succeed, but things may go crazy in 
unexpected corners (like UDF).


Cheng

On 12/22/14 4:15 PM, Ji ZHANG wrote:

Hi,

Recently I'm migrating from Shark 0.9 to Spark SQL 1.2, my CDH version
is 4.5, Hive 0.11. I've managed to setup Spark SQL Thriftserver, and
normal queries work fine, but custom UDF is not usable.

The symptom is when executing CREATE TEMPORARY FUNCTION, the query
hangs on a lock request:

14/12/22 14:41:57 DEBUG ClientCnxn: Reading reply
sessionid:0x34a6121e6d93e74, packet:: clientPath:null serverPath:null
finished:false header:: 289,8  replyHeader:: 289,51540866762,0
request:: '/hive_zookeeper_namespace_hive1/default,F  response::
v{'sample_07,'LOCK-EXCLUSIVE-0001565612,'LOCK-EXCLUSIVE-0001565957}
14/12/22 14:41:57 ERROR ZooKeeperHiveLockManager: conflicting lock
present for default mode EXCLUSIVE

Is it a compatibility issue because Spark SQL 1.2 is based on Hive
0.13? Is there a workaround instead of upgrading CDH or forbidding UDF
on Spark SQL?

Thanks.




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Graceful shutdown in spark streaming

2014-12-22 Thread Jesper Lundgren

Hello all,

I have a spark streaming application running in a standalone cluster
(deployed with spark-submit --deploy-mode cluster). I am trying to add
graceful shutdown functionality to this application but I am not sure what
is the best practice for this.

Currently I am using this code:

sys.addShutdownHook {
  log.info(Shutdown requested. Graceful shutdown started.)
  ssc.stop(stopSparkContext = true, stopGracefully= true)
  log.info(Shutdown Complete. Bye)
}
ssc.start()
ssc.awaitTermination()

It seems to be working but I still get some errors in the driver log and
the master UI shows failed as status for the driver after it is stopped

Driver log:

14/12/22 16:42:30 INFO Main: Shutdown requested. Graceful shutdown started.
.
.
.
14/12/22 16:42:40 INFO JobGenerator: Stopping JobGenerator gracefully.
.
.
.
14/12/22 16:42:50 INFO DAGScheduler: Job 1 failed: start at Main.scala:114,
took 93.444222 s Exception in thread Thread-34
org.apache.spark.SparkException: Job cancelled because SparkContext was
shut down at
org.apache.spark.scheduler.DAGScheduler$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
at
org.apache.spark.scheduler.DAGScheduler$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375)
at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$finishTerminate(FaultHandling.scala:210)
at
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369) at
akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462) at
akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at
akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at
akka.dispatch.Mailbox.run(Mailbox.scala:219) at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/12/22 16:42:51 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor
stopped! 14/12/22 16:42:51 INFO MemoryStore: MemoryStore cleared 14/12/22
16:42:51 INFO BlockManager: BlockManager stopped 14/12/22 16:42:51 INFO
BlockManagerMaster: BlockManagerMaster stopped 14/12/22 16:42:51 INFO
SparkContext: Successfully stopped SparkContext 14/12/22 16:42:51 INFO
Main: Shutdown Complete. Bye
(end of driver log)

Anyone has experience to share regarding graceful shutdown in production
for spark streaming?

Thanks!

Best Regards,
Jesper Lundgren

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-12-22 Thread pradhandeep

Did you try running PageRank.scala instead of LiveJournalPageRank.scala?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p20808.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to get list of edges between two Vertex ?

2014-12-22 Thread pradhandeep

Do you need the multiple edges or can you get the work done by having single
edge between two vertices?
In my view point, you can group the edges using groupEdges which will group
the same edges together. It may work because the message passed between the
vertices through same edges (replicated) will not be different.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-list-of-edges-between-two-Vertex-tp19309p20809.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Possible problems in packaging mlllib

2014-12-22 Thread shkesar

I am trying to run the  twitter classifier
https://github.com/databricks/reference-apps  

A NoClasssDefFoundError pops up. I've checked the library that the HashingTF
class file is there. Some stack overflow questions show that might be
problem with packaging the class.

Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/mllib/feature/HashingTF
at com.databricks.apps.twitter_classifier.Utils$.init(Utils.scala:12)
at com.databricks.apps.twitter_classifier.Utils$.clinit(Utils.scala)
at 
com.databricks.apps.twitter_classifier.Collect$.main(Collect.scala:26)
at com.databricks.apps.twitter_classifier.Collect.main(Collect.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.mllib.feature.HashingTF
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-problems-in-packaging-mlllib-tp20810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Possible problems in packaging mlllib

2014-12-22 Thread Sean Owen

Are you using an old version of Spark? I think this appeared in 1.1.
You don't usually package this class or MLlib, so your packaging
probably is not relevant, but it has to be available at runtime on
your cluster then.

On Mon, Dec 22, 2014 at 10:16 AM, shkesar shubhamke...@live.com wrote:
 I am trying to run the  twitter classifier
 https://github.com/databricks/reference-apps

 A NoClasssDefFoundError pops up. I've checked the library that the HashingTF
 class file is there. Some stack overflow questions show that might be
 problem with packaging the class.

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/mllib/feature/HashingTF
 at 
 com.databricks.apps.twitter_classifier.Utils$.init(Utils.scala:12)
 at com.databricks.apps.twitter_classifier.Utils$.clinit(Utils.scala)
 at 
 com.databricks.apps.twitter_classifier.Collect$.main(Collect.scala:26)
 at com.databricks.apps.twitter_classifier.Collect.main(Collect.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at 
 com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.mllib.feature.HashingTF
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 9 more



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Possible-problems-in-packaging-mlllib-tp20810.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Using more cores on machines

2014-12-22 Thread Ashic Mahtab

Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 
4 cores to a streaming application. I can do this via spark submit by:

spark-submit  --total-executor-cores 4

However, this assigns one core per machine. I would like to use 2 cores on 2 
machines instead, leaving the other two machines untouched. Is this possible? 
Is there a downside to doing this? My thinking is that I should be able to 
reduce quite a bit of network traffic if all machines are not involved.


Thanks,
Ashic.

Re: Using more cores on machines

2014-12-22 Thread Sean Owen

I think you want:

--num-executors 2 --executor-cores 2

On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote:
 Hi,
 Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to
 dedicate 4 cores to a streaming application. I can do this via spark submit
 by:

 spark-submit  --total-executor-cores 4

 However, this assigns one core per machine. I would like to use 2 cores on 2
 machines instead, leaving the other two machines untouched. Is this
 possible? Is there a downside to doing this? My thinking is that I should be
 able to reduce quite a bit of network traffic if all machines are not
 involved.


 Thanks,
 Ashic.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab

Hi Sean,
Thanks for the response. 

It seems --num-executors is ignored. Specifying --num-executors 2 
--executor-cores 2 is giving the app all 8 cores across 4 machines.

-Ashic.

 From: so...@cloudera.com
 Date: Mon, 22 Dec 2014 10:57:31 +
 Subject: Re: Using more cores on machines
 To: as...@live.com
 CC: user@spark.apache.org

 I think you want:

 --num-executors 2 --executor-cores 2

 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote:
  Hi,
  Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to
  dedicate 4 cores to a streaming application. I can do this via spark submit
  by:

  spark-submit  --total-executor-cores 4

  However, this assigns one core per machine. I would like to use 2 cores on 2
  machines instead, leaving the other two machines untouched. Is this
  possible? Is there a downside to doing this? My thinking is that I should be
  able to reduce quite a bit of network traffic if all machines are not
  involved.

  Thanks,
  Ashic.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: java.sql.SQLException: No suitable driver found

2014-12-22 Thread Michael Orr

Here is a script I use to submit a directory of jar files. It assumes jar files 
are in target/dependency or lib/

DRIVER_PATH=
DEPEND_PATH=
if [ -d lib ]; then
  DRIVER_PATH=lib
  DEPEND_PATH=lib
else
  DRIVER_PATH=target
  DEPEND_PATH=target/dependency
fi

DEPEND_JARS=log4j.properties
for f in `ls $DEPEND_PATH`; do DEPEND_JARS=$DEPEND_JARS,$DEPEND_PATH/$f; done

$SPARK_HOME/bin/spark-submit \
  --class $1 \
  --master yarn-client \
  --num-executors 1 \
  --driver-memory 4g \
  --executor-memory 16g \
  --executor-cores 4 \
  --jars $DEPEND_JARS \
  $DRIVER_PATH/core-ingest-*.jar ${*:2}


You would run it with a command like:

./run.sh class.to.submit arg1 arg2 …


 On Dec 22, 2014, at 1:11 AM, durga durgak...@gmail.com wrote:
 
 One more question.
 
 How would I submit additional jars to the spark-submit job. I used --jars
 option, it seems it is not working as explained earlier.
 
 Thanks for the help,
 
 -D
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-sql-SQLException-No-suitable-driver-found-tp20792p20805.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: locality sensitive hashing for spark

2014-12-22 Thread Michael Orr

The implementation closely aligns with jaccard. It should be possible to swap 
out the hash functions to a family that is compatible with other distance 
measures.



 On Dec 22, 2014, at 1:16 AM, Nick Pentreath nick.pentre...@gmail.com wrote:
 
 Looks interesting thanks for sharing.
 
 Does it support cosine similarity ? I only saw jaccard mentioned from a quick 
 glance.
 
 —
 Sent from Mailbox https://www.dropbox.com/mailbox
 
 On Mon, Dec 22, 2014 at 4:12 AM, morr0723 michael.d@gmail.com 
 mailto:michael.d@gmail.com wrote:
 
 I've pushed out an implementation of locality sensitive hashing for spark. 
 LSH has a number of use cases, most prominent being if the features are not 
 based in Euclidean space. 
 
 Code, documentation, and small exemplar dataset is available on github: 
 
 https://github.com/mrsqueeze/spark-hash 
 
 Feel free to pass along any comments or issues. 
 
 Enjoy! 
 
 
 
 
 
 -- 
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/locality-sensitive-hashing-for-spark-tp20803.html
  
 Sent from the Apache Spark User List mailing list archive at Nabble.com. 
 
 - 
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 For additional commands, e-mail: user-h...@spark.apache.org

Re: S3 files , Spark job hungsup

2014-12-22 Thread Shuai Zheng

Is it possible too many connections open to read from s3 from one node? I
have this issue before because I open a few hundreds of files on s3 to read
from one node. It just block itself without error until timeout later.

On Monday, December 22, 2014, durga durgak...@gmail.com wrote:

 Hi All,

 I am facing a strange issue sporadically. occasionally my spark job is
 hungup on reading s3 files. It is not throwing exception . or making some
 progress, it is just hungs up there.

 Is this a known issue , Please let me know how could I solve this issue.

 Thanks,
 -D



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3-files-Spark-job-hungsup-tp20806.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;

Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Xiaoyu Wang

Hello everyone!

Like the title.
I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the server to 
execute SQL.
I want to kill one SQL job running in the thrift server and not kill the thrift 
server.
I set property spark.ui.killEnabled=true in spark-default.conf
But in the UI, only stages can be killed, and the job can’t be killed!
Is any way to kill the SQL job in the thrift server?


Xiaoyu Wang
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Fetch Failure

2014-12-22 Thread steghe

Which version of spark are you running?

It could be related to this
https://issues.apache.org/jira/browse/SPARK-3633

fixed in 1.1.1 and 1.2.0





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Effects problems in logistic regression

2014-12-22 Thread Franco Barrientos

Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me!

 

De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:42
Para: 'DB Tsai'
CC: 'Sean Owen'; user@spark.apache.org
Asunto: RE: Effects problems in logistic regression

 

Thanks I will try.

 

De: DB Tsai [mailto:dbt...@dbtsai.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Can you try LogisticRegressionWithLBFGS? I verified that this will be converged 
to the same result trained by R's glmnet package without regularization. The 
problem of LogisticRegressionWithSGD is it's very slow in term of converging, 
and lots of time, it's very sensitive to stepsize which can lead to wrong 
answer. 

 

The regularization logic in MLLib is not entirely correct, and it will penalize 
the intercept. In general, with really high regularization, all the 
coefficients will be zeros except the intercept. In logistic regression, the 
non-zero intercept can be understood as the prior-probability of each class, 
and in linear regression, this will be the mean of response. I'll have a PR to 
fix this issue.





Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

 

On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  
wrote:

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com mailto:so...@cloudera.com ] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649 tel:%28%2B562%29-29699649 
(+569)-76347893 tel:%28%2B569%29-76347893 

franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  

www.exalitica.com http://www.exalitica.com/ 


  http://exalitica.com/web/img/frim.png

Spark exception when sending message to akka actor

2014-12-22 Thread Priya Ch

Hi All,

I have akka remote actors running on 2 nodes. I submitted spark application
from node1. In the spark code, in one of the rdd, i am sending message to
actor running on node1. My Spark code is as follows:




class ActorClient extends Actor with Serializable
{
  import context._

  val currentActor: ActorSelection =
context.system.actorSelection(akka.tcp://
ActorSystem@192.168.145.183:2551/user/MasterActor)
  implicit val timeout = Timeout(10 seconds)


  def receive =
  {
  case msg:String = { if(msg.contains(Spark))
   { currentActor ! msg
 sender ! Local
   }
   else
   {
println(Received..+msg)
val future=currentActor ? msg
val result = Await.result(future,
timeout.duration).asInstanceOf[String]
if(result.contains(ACK))
  sender ! OK
   }
 }
  case PoisonPill = context.stop(self)
  }
}

object SparkExec extends Serializable
{

  implicit val timeout = Timeout(10 seconds)
   val actorSystem=ActorSystem(ClientActorSystem)
   val
actor=actorSystem.actorOf(Props(classOf[ActorClient]),name=ClientActor)

 def main(args:Array[String]) =
  {

 val conf = new SparkConf().setAppName(DeepLearningSpark)

 val sc=new SparkContext(conf)

val
textrdd=sc.textFile(hdfs://IMPETUS-DSRV02:9000/deeplearning/sample24k.csv)
val rdd1=textrddmap{ line = println(In Map...)

   val future = actor ? Hello..Spark
   val result =
Await.result(future,timeout.duration).asInstanceOf[String]
   if(result.contains(Local)){
 println(Recieved in map+result)
  //actorSystem.shutdown
  }
  (10)
 }


 val rdd2=rdd1.map{ x =
 val future=actor ? Done
 val result = Await.result(future,
timeout.duration).asInstanceOf[String]
  if(result.contains(OK))
  {
   actorSystem.stop(remoteActor)
   actorSystem.shutdown
  }
 (2) }
 rdd2.saveAsTextFile(/home/padma/SparkAkkaOut)
}

}

In my ActorClientActor, through actorSelection, identifying the remote
actor and sending the message. Once the messages are sent, in *rdd2*, after
receiving ack from remote actor, i am killing the actor ActorClient and
shutting down the ActorSystem.

The above code is throwing the following exception:




14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
(TID 1, IMPETUS-DSRV05.impetus.co.in):
java.lang.ExceptionInInitializerError:
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166)
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)
14/12/22 19:04:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0, IMPETUS-DSRV05.impetus.co.in): java.lang.NoClassDefFoundError:
Could not initialize class com.impetus.spark.SparkExec$
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:166)
com.impetus.spark.SparkExec$$anonfun$2.apply(SparkExec.scala:159)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

Re: Effects problems in logistic regression

2014-12-22 Thread DB Tsai

Sounds great.


Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

 Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me!



 *De:* Franco Barrientos [mailto:franco.barrien...@exalitica.com]
 *Enviado el:* jueves, 18 de diciembre de 2014 16:42
 *Para:* 'DB Tsai'
 *CC:* 'Sean Owen'; user@spark.apache.org
 *Asunto:* RE: Effects problems in logistic regression



 Thanks I will try.



 *De:* DB Tsai [mailto:dbt...@dbtsai.com dbt...@dbtsai.com]
 *Enviado el:* jueves, 18 de diciembre de 2014 16:24
 *Para:* Franco Barrientos
 *CC:* Sean Owen; user@spark.apache.org
 *Asunto:* Re: Effects problems in logistic regression



 Can you try LogisticRegressionWithLBFGS? I verified that this will be
 converged to the same result trained by R's glmnet package without
 regularization. The problem of LogisticRegressionWithSGD is it's very
 slow in term of converging, and lots of time, it's very sensitive to
 stepsize which can lead to wrong answer.



 The regularization logic in MLLib is not entirely correct, and it will
 penalize the intercept. In general, with really high regularization, all
 the coefficients will be zeros except the intercept. In logistic
 regression, the non-zero intercept can be understood as the
 prior-probability of each class, and in linear regression, this will be the
 mean of response. I'll have a PR to fix this issue.



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai



 On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
 franco.barrien...@exalitica.com wrote:

 Yes, without the “amounts” variables the results are similiar. When I put
 other variables its fine.



 *De:* Sean Owen [mailto:so...@cloudera.com]
 *Enviado el:* jueves, 18 de diciembre de 2014 14:22
 *Para:* Franco Barrientos
 *CC:* user@spark.apache.org
 *Asunto:* Re: Effects problems in logistic regression



 Are you sure this is an apples-to-apples comparison? for example does your
 SAS process normalize or otherwise transform the data first?



 Is the optimization configured similarly in both cases -- same
 regularization, etc.?



 Are you sure you are pulling out the intercept correctly? It is a separate
 value from the logistic regression model in Spark.



 On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
 franco.barrien...@exalitica.com wrote:

 Hi all!,



 I have a problem with LogisticRegressionWithSGD, when I train a data set
 with one variable (wich is a amount of an item) and intercept, I get
 weights of

 (-0.4021,-207.1749) for both features, respectively. This don´t make sense
 to me because I run a logistic regression for the same data in SAS and I
 get these weights (-2.6604,0.000245).



 The rank of this variable is from 0 to 59102 with a mean of 1158.



 The problem is when I want to calculate the probabilities for each user
 from data set, this probability is near to zero or zero in much cases,
 because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is
 a big number, in fact infinity for spark.



 How can I treat this variable? or why this happened?



 Thanks ,



 *Franco Barrientos*
 Data Scientist

 Málaga #115, Of. 1003, Las Condes.
 Santiago, Chile.
 (+562)-29699649
 (+569)-76347893

 franco.barrien...@exalitica.com

 www.exalitica.com

 [image: http://exalitica.com/web/img/frim.png]

Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas

Hi,

After facing issues with the performance of some of our Spark Streaming
 jobs, we invested quite some effort figuring out the factors that affect
the performance characteristics of a Streaming job. We  defined an
empirical model that helps us reason about Streaming jobs and applied it to
tune the jobs in order to maximize throughput.

We have summarized our findings in a blog post with the intention of
collecting feedback and hoping that it is useful to other Spark Streaming
users facing similar issues.

 http://www.virdata.com/tuning-spark/

Your feedback is welcome.

With kind regards,

Gerard.
Data Processing Team Lead
Virdata.com
@maasg

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab

Hi Josh,
I'm not looking to change the 1:1 ratio.

What I'm trying to do is get both cores on two machines working, rather than 
one core on all four machines. With --total-executor-cores 4, I have 1 core per 
machine working for an app. I'm looking for something that'll let me use 2 
cores per machine on 2 machines (so 4 cores in total) while not using the other 
two machines.

Regards,
Ashic.

 From: j...@soundcloud.com
 Date: Mon, 22 Dec 2014 17:36:26 +0100
 Subject: Re: Using more cores on machines
 To: as...@live.com
 CC: so...@cloudera.com; user@spark.apache.org
 
 AFAIK, `--num-executors` is not available for standalone clusters. In
 standalone mode, you must start new workers on your node as it is a
 1:1 ratio of workers to executors.
 
 
 On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote:
  Hi Sean,
  Thanks for the response.
 
  It seems --num-executors is ignored. Specifying --num-executors 2
  --executor-cores 2 is giving the app all 8 cores across 4 machines.
 
  -Ashic.
 
  From: so...@cloudera.com
  Date: Mon, 22 Dec 2014 10:57:31 +
  Subject: Re: Using more cores on machines
  To: as...@live.com
  CC: user@spark.apache.org
 
 
  I think you want:
 
  --num-executors 2 --executor-cores 2
 
  On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote:
   Hi,
   Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to
   dedicate 4 cores to a streaming application. I can do this via spark
   submit
   by:
  
   spark-submit  --total-executor-cores 4
  
   However, this assigns one core per machine. I would like to use 2 cores
   on 2
   machines instead, leaving the other two machines untouched. Is this
   possible? Is there a downside to doing this? My thinking is that I
   should be
   able to reduce quite a bit of network traffic if all machines are not
   involved.
  
  
   Thanks,
   Ashic.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: MLlib, classification label problem

2014-12-22 Thread Sean Owen

Yeah, it's mentioned in the doc:

Note that, in the mathematical formulation in this guide, a training label
y is denoted as either +1 (positive) or −1 (negative), which is
convenient for the formulation. However, the negative label is
represented by 0 in MLlib instead of −1, to be consistent with
multiclass labeling.

Both are valid and equally correct, although the two conventions lead
to different expressions for the gradients and loss functions. I also
find it is a little confusing since the docs explain one form, and the
code implements another form (except some examples, which actually
reimplement with the -1/+1 convention).

I personally am also more used to the forms corresponding to 0 for the
negative class, but I'm sure some will say they're more accustomed to
the other convention.

On Mon, Dec 22, 2014 at 4:02 PM, Hao Ren inv...@gmail.com wrote:
Hi,

When going through the MLlib doc for classification:
http://spark.apache.org/docs/latest/mllib-linear-methods.html, I find that
the loss functions are based on label {1, -1}.

But in MLlib, the loss functions on label {1, 0} are used. And there is a
dataValidation check before fitting, if a label is other than 0 or 1, an
exception will be thrown.

I don't understand the intention here. Could someone explain this ?

Hao.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-classification-label-problem-tp20813.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using more cores on machines

2014-12-22 Thread Boromir Widas

If you are looking to reduce network traffic then setting
spark.deploy.spreadOut
to false may help.

On Mon, Dec 22, 2014 at 11:44 AM, Ashic Mahtab as...@live.com wrote:

 Hi Josh,
 I'm not looking to change the 1:1 ratio.

 What I'm trying to do is get both cores on two machines working, rather
 than one core on all four machines. With --total-executor-cores 4, I have 1
 core per machine working for an app. I'm looking for something that'll let
 me use 2 cores per machine on 2 machines (so 4 cores in total) while not
 using the other two machines.

 Regards,
 Ashic.

  From: j...@soundcloud.com
  Date: Mon, 22 Dec 2014 17:36:26 +0100
  Subject: Re: Using more cores on machines
  To: as...@live.com
  CC: so...@cloudera.com; user@spark.apache.org

 
  AFAIK, `--num-executors` is not available for standalone clusters. In
  standalone mode, you must start new workers on your node as it is a
  1:1 ratio of workers to executors.
 
 
  On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote:
   Hi Sean,
   Thanks for the response.
  
   It seems --num-executors is ignored. Specifying --num-executors 2
   --executor-cores 2 is giving the app all 8 cores across 4 machines.
  
   -Ashic.
  
   From: so...@cloudera.com
   Date: Mon, 22 Dec 2014 10:57:31 +
   Subject: Re: Using more cores on machines
   To: as...@live.com
   CC: user@spark.apache.org
  
  
   I think you want:
  
   --num-executors 2 --executor-cores 2
  
   On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com
 wrote:
Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like
 to
dedicate 4 cores to a streaming application. I can do this via spark
submit
by:
   
spark-submit  --total-executor-cores 4
   
However, this assigns one core per machine. I would like to use 2
 cores
on 2
machines instead, leaving the other two machines untouched. Is this
possible? Is there a downside to doing this? My thinking is that I
should be
able to reduce quite a bit of network traffic if all machines are
 not
involved.
   
   
Thanks,
Ashic.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: S3 files , Spark job hungsup

2014-12-22 Thread durga katakam

Yes . I am reading thousands of files every hours. Is there any way I can
tell spark to timeout.
Thanks for your help.

-D

On Mon, Dec 22, 2014 at 4:57 AM, Shuai Zheng szheng.c...@gmail.com wrote:

 Is it possible too many connections open to read from s3 from one node? I
 have this issue before because I open a few hundreds of files on s3 to read
 from one node. It just block itself without error until timeout later.

 On Monday, December 22, 2014, durga durgak...@gmail.com wrote:

 Hi All,

 I am facing a strange issue sporadically. occasionally my spark job is
 hungup on reading s3 files. It is not throwing exception . or making some
 progress, it is just hungs up there.

 Is this a known issue , Please let me know how could I solve this issue.

 Thanks,
 -D



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3-files-Spark-job-hungsup-tp20806.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: custom python converter from HBase Result to tuple

2014-12-22 Thread Ted Yu

Which HBase version are you using ?

Can you show the full stack trace ?

Cheers

On Mon, Dec 22, 2014 at 11:02 AM, Antony Mayi antonym...@yahoo.com.invalid
wrote:

 Hi,

 can anyone please give me some help how to write custom converter of hbase
 data to (for example) tuples of ((family, qualifier, value), ) for pyspark:

 I was trying something like (here trying to tuples of
 (family:qualifier:value, )):


 class HBaseResultToTupleConverter extends Converter[Any, List[String]] {
   override def convert(obj: Any): List[String] = {
 val result = obj.asInstanceOf[Result]
 result.rawCells().map(cell =
 List(Bytes.toString(CellUtil.cloneFamily(cell)),
   Bytes.toString(CellUtil.cloneQualifier(cell)),
   Bytes.toString(CellUtil.cloneValue(cell))).mkString(:)
 ).toList
   }
 }


 but then I get a error:

 14/12/22 16:27:40 WARN python.SerDeUtil:
 Failed to pickle Java object as value: $colon$colon, falling back
 to 'toString'. Error: couldn't introspect javabean:
 java.lang.IllegalArgumentException: wrong number of arguments


 does anyone have a hint?

 Thanks,
 Antony.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas

Hi Tim,

That would be awesome. We have seen some really disparate Mesos allocations
for our Spark Streaming jobs. (like (7,4,1) over 3 executors for 4 kafka
consumer instead of the ideal (3,3,3,3))
For network dependent consumers, achieving an even deployment would
 provide a reliable and reproducible streaming job execution from the
performance point of view.
We're deploying in coarse grain mode. Not sure Spark Streaming would work
well in fine-grained given the added latency to acquire a worker.

You mention that you're changing the Mesos scheduler. Is there a Jira where
this job is taking place?

-kr, Gerard.


On Mon, Dec 22, 2014 at 6:01 PM, Timothy Chen tnac...@gmail.com wrote:

 Hi Gerard,

 Really nice guide!

 I'm particularly interested in the Mesos scheduling side to more evenly
 distribute cores across cluster.

 I wonder if you are using coarse grain mode or fine grain mode?

 I'm making changes to the spark mesos scheduler and I think we can propose
 a best way to achieve what you mentioned.

 Tim

 Sent from my iPhone

  On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote:
 
  Hi,
 
  After facing issues with the performance of some of our Spark Streaming
  jobs, we invested quite some effort figuring out the factors that affect
  the performance characteristics of a Streaming job. We  defined an
  empirical model that helps us reason about Streaming jobs and applied it
 to
  tune the jobs in order to maximize throughput.
 
  We have summarized our findings in a blog post with the intention of
  collecting feedback and hoping that it is useful to other Spark Streaming
  users facing similar issues.
 
  http://www.virdata.com/tuning-spark/
 
  Your feedback is welcome.
 
  With kind regards,
 
  Gerard.
  Data Processing Team Lead
  Virdata.com
  @maasg

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng

Dear Spark users and developers,

I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install packages
for any version of Spark, and makes it easy for developers to
contribute packages.

Spark Packages will feature integrations with various data sources,
management tools, higher level domain-specific libraries, machine
learning algorithms, code samples, and other Spark content. Thanks to
the package authors, the initial listing of packages includes
scientific computing libraries, a job execution server, a connector
for importing Avro data, tools for launching Spark on Google Compute
Engine, and many others.

I’d like to invite you to contribute and use Spark Packages and
provide feedback! As a disclaimer: Spark Packages is a community index
maintained by Databricks and (by design) will include packages outside
of the ASF Spark project. We are excited to help showcase and support
all of the great work going on in the broader Spark community!

Cheers,
Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-22 Thread Sadhan Sood

Thanks Cheng, Michael - that was super helpful.

On Sun, Dec 21, 2014 at 7:27 AM, Cheng Lian lian.cs@gmail.com wrote:

  Would like to add that compression schemes built in in-memory columnar
 storage only supports primitive columns (int, string, etc.), complex types
 like array, map and struct are not supported.


 On 12/20/14 6:17 AM, Sadhan Sood wrote:

  Hey Michael,

 Thank you for clarifying that. Is tachyon the right way to get compressed
 data in memory or should we explore the option of adding compression to
 cached data. This is because our uncompressed data set is too big to fit in
 memory right now. I see the benefit of tachyon not just with storing
 compressed data in memory but we wouldn't have to create a separate table
 for caching some partitions like 'cache table table_cached as select * from
 table where date = 201412XX' - the way we are doing right now.


 On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

 There is only column level encoding (run length encoding, delta encoding,
 dictionary encoding) and no generic compression.

 On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

  Wondering if when caching a table backed by lzo compressed parquet
 data, if spark also compresses it (using lzo/gzip/snappy) along with column
 level encoding or just does the column level encoding when 
 *spark.sql.inMemoryColumnarStorage.compressed
 *is set to true. This is because when I try to cache the data, I notice
 the memory being used is almost as much as the uncompressed size of the
 data.

  Thanks!

Re: UNION two RDDs

2014-12-22 Thread Jerry Lam

Hi Sean and Madhu,

Thank you for the explanation. I really appreciate it.

Best Regards,

Jerry


On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen so...@cloudera.com wrote:

 coalesce actually changes the number of partitions. Unless the
 original RDD had just 1 partition, coalesce(1) will make an RDD with 1
 partition that is larger than the original partitions, of course.

 I don't think the question is about ordering of things within an
 element of the RDD?

 If the original RDD was sorted, and so has a defined ordering, then it
 will be preserved. Otherwise I believe you do not have any guarantees
 about ordering. In practice, you may find that you still encounter the
 elements in the same order after coalesce(1), although I am not sure
 that is even true.

 union() is the same story; unless the RDDs are sorted I don't think
 there are guarantees. However I'm almost certain that in practice, as
 it happens now, A's elements would come before B's after a union, if
 you did traverse them.

 On Fri, Dec 19, 2014 at 5:41 AM, madhu phatak phatak@gmail.com
 wrote:
  Hi,
  coalesce is an operation which changes no of records in a partition. It
 will
  not touch ordering with in a row AFAIK.
 
  On Fri, Dec 19, 2014 at 2:22 AM, Jerry Lam chiling...@gmail.com wrote:
 
  Hi Spark users,
 
  I wonder if val resultRDD = RDDA.union(RDDB) will always have records in
  RDDA before records in RDDB.
 
  Also, will resultRDD.coalesce(1) change this ordering?
 
  Best Regards,
 
  Jerry
 
 
 
  --
  Regards,
  Madhukara Phatak
  http://www.madhukaraphatak.com

MLLib beginner question

2014-12-22 Thread boci

Hi!

I want to try out spark mllib in my spark project, but I got a little
problem. I have training data (external file), but the real data com from
another rdd. How can I do that?
I try to simple using same SparkContext to boot rdd (first I create rdd
using sc.textFile() and after NaiveBayes.train... After that I want to
fetch the real data using same context and internal the map using the
predict. But My application never exit (I think stucked or something). Why
not work this solution?

Thanks

b0c1


--
Skype: boci13, Hangout: boci.b...@gmail.com

Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-22 Thread Sean Owen

Just closing the loop -- FWIW this was indeed on purpose --
https://issues.apache.org/jira/browse/SPARK-3452 . I take it that it's
not encouraged to depend on the REPL as a module.

On Sun, Dec 21, 2014 at 10:34 AM, Sean Owen so...@cloudera.com wrote:
 I'm only speculating, but I wonder if it was on purpose? would people
 ever build an app against the REPL?

 On Sun, Dec 21, 2014 at 5:50 AM, Peng Cheng pc...@uow.edu.au wrote:
 Everything else is there except spark-repl. Can someone check that out this
 weekend?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Long-running job cleanup

2014-12-22 Thread Ganelin, Ilya

Hi all, I have a long running job iterating over a huge dataset. Parts of this 
operation are cached. Since the job runs for so long, eventually the overhead 
of spark shuffles starts to accumulate culminating in the driver starting to 
swap.

I am aware of the spark.cleanup.tll parameter that allows me to configure when 
cleanup happens but the issue with doing this is that it isn’t done safely, 
e.g. I can be in the middle of processing a stage when this cleanup happens and 
my cached RDDs get cleared. This ultimately causes a KeyNotFoundException when 
I try to reference the now cleared cached RDD. This behavior doesn’t make much 
sense to me, I would expect the cached RDD to either get regenerated or at the 
very least for there to be an option to execute this cleanup without deleting 
those RDDs.

Is there a programmatically safe way of doing this cleanup that doesn’t break 
everything?

If I instead tear down the spark context and bring up a new context for every 
iteration (assuming that each iteration is sufficiently long-lived), would 
memory get released appropriately?


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: Spark in Standalone mode

2014-12-22 Thread durga

Please check your spark version and hadoop version in your mvn as well as
local spark setup. If hadoop versions not matching you might get this issue.

Thanks,
-D



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-Standalone-mode-tp20780p20815.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-22 Thread peng

Thanks a lot for point it out. I also found it in pom.xml.
A new ticket for reverting it has been submitted:
https://issues.apache.org/jira/browse/SPARK-4923
At first I assume that further development on it has been moved to
databricks cloud. But the JIRA ticket was already there in September. So
maybe demand on this API from the community is indeed low enough.
However, I would still suggest keeping it, even promoting it into a
Developer's API, this would encourage more projects to integrate in a
more flexible way, and save prototyping/QA cost by customizing fixtures
of REPL. People will still move to databricks cloud, which has far more
features than that. Many influential projects already depends on the
routinely published Scala-REPL (e.g. playFW), it would be strange for
Spark not doing the same.

What do you think?

Yours Peng

On 12/22/2014 04:57 PM, Sean Owen wrote:

Just closing the loop -- FWIW this was indeed on purpose --
https://issues.apache.org/jira/browse/SPARK-3452 . I take it that it's
not encouraged to depend on the REPL as a module.

On Sun, Dec 21, 2014 at 10:34 AM, Sean Owen so...@cloudera.com wrote:

I'm only speculating, but I wonder if it was on purpose? would people
ever build an app against the REPL?

On Sun, Dec 21, 2014 at 5:50 AM, Peng Cheng pc...@uow.edu.au wrote:

Everything else is there except spark-repl. Can someone check that out this
weekend?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Announcing Spark Packages

2014-12-22 Thread peng


Me 2 :)

On 12/22/2014 06:14 PM, Andrew Ash wrote:

Hi Xiangrui,

That link is currently returning a 503 Over Quota error message.  
Would you mind pinging back out when the page is back up?


Thanks!
Andrew

On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com 
mailto:men...@gmail.com wrote:


Dear Spark users and developers,

I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install packages
for any version of Spark, and makes it easy for developers to
contribute packages.

Spark Packages will feature integrations with various data sources,
management tools, higher level domain-specific libraries, machine
learning algorithms, code samples, and other Spark content. Thanks to
the package authors, the initial listing of packages includes
scientific computing libraries, a job execution server, a connector
for importing Avro data, tools for launching Spark on Google Compute
Engine, and many others.

I’d like to invite you to contribute and use Spark Packages and
provide feedback! As a disclaimer: Spark Packages is a community index
maintained by Databricks and (by design) will include packages outside
of the ASF Spark project. We are excited to help showcase and support
all of the great work going on in the broader Spark community!

Cheers,
Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Re: Announcing Spark Packages

2014-12-22 Thread Hitesh Shah

Hello Xiangrui, 

If you have not already done so, you should look at 
http://www.apache.org/foundation/marks/#domains for the policy on use of ASF 
trademarked terms in domain names. 

thanks
— Hitesh

On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:

 Dear Spark users and developers,
 
 I’m happy to announce Spark Packages (http://spark-packages.org), a
 community package index to track the growing number of open source
 packages and libraries that work with Apache Spark. Spark Packages
 makes it easy for users to find, discuss, rate, and install packages
 for any version of Spark, and makes it easy for developers to
 contribute packages.
 
 Spark Packages will feature integrations with various data sources,
 management tools, higher level domain-specific libraries, machine
 learning algorithms, code samples, and other Spark content. Thanks to
 the package authors, the initial listing of packages includes
 scientific computing libraries, a job execution server, a connector
 for importing Avro data, tools for launching Spark on Google Compute
 Engine, and many others.
 
 I’d like to invite you to contribute and use Spark Packages and
 provide feedback! As a disclaimer: Spark Packages is a community index
 maintained by Databricks and (by design) will include packages outside
 of the ASF Spark project. We are excited to help showcase and support
 all of the great work going on in the broader Spark community!
 
 Cheers,
 Xiangrui
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can Spark SQL thrift server UI provide JOB kill operate or any REST API?

2014-12-22 Thread Michael Armbrust

I would expect that killing a stage would kill the whole job.  Are you not
seeing that happen?

On Mon, Dec 22, 2014 at 5:09 AM, Xiaoyu Wang wangxy...@gmail.com wrote:

 Hello everyone!

 Like the title.
 I start the Spark SQL 1.2.0 thrift server. Use beeline connect to the
 server to execute SQL.
 I want to kill one SQL job running in the thrift server and not kill the
 thrift server.
 I set property spark.ui.killEnabled=true in spark-default.conf
 But in the UI, only stages can be killed, and the job can’t be killed!
 Is any way to kill the SQL job in the thrift server?


 Xiaoyu Wang
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Xiangrui Meng

Did you check the indices in the LIBSVM data and the master file? Do
they match? -Xiangrui

On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote:
 Hi All,
 I use LIBSVM format to specify my input feature vector, which used 1-based
 index. When I run regression the o/p is 0-indexed based. I have a master
 lookup file that maps back these indices to what they stand or. However, I
 need to add offset of 2 and not 1 to the regression outcome during the
 mapping. So for example to map the index of 800 from the regression output
 file, I look for 802 in my master lookup file and then things make sense. I
 can understand adding offset of 1, but not sure why adding offset 2 is
 working fine. Have others seem something like this as well?


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib beginner question

2014-12-22 Thread Xiangrui Meng

How big is the dataset you want to use in prediction? -Xiangrui

On Mon, Dec 22, 2014 at 1:47 PM, boci boci.b...@gmail.com wrote:
 Hi!

 I want to try out spark mllib in my spark project, but I got a little
 problem. I have training data (external file), but the real data com from
 another rdd. How can I do that?
 I try to simple using same SparkContext to boot rdd (first I create rdd
 using sc.textFile() and after NaiveBayes.train... After that I want to fetch
 the real data using same context and internal the map using the predict. But
 My application never exit (I think stucked or something). Why not work this
 solution?

 Thanks

 b0c1


 --
 Skype: boci13, Hangout: boci.b...@gmail.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Sameer Tilak

Hi,It is a text format in which each line represents a labeled sparse feature 
vector using the following format:label index1:value1 index2:value2 ...This was 
the confusing part in the documentation:
where the indices are one-based and in ascending order. After loading, the 
feature indices are converted to zero-based.
Let us say that I have 40 features so I create an index file like this:
Feature, index number:F1   1F2   2F3   3...F4   40
I then create my feature vectors and in the libsvm format something like:1 10:1 
20:0 8:1 4:0 24:11 1:1 40:0 2:1 8:0 9:1 23:10 23:1 18:0 13:1.

I run regression and get back models.weights which are 40 weights.Say I get 
0.110.34450.5...
In that case does the first weight (0.11) correspond to index 1/ F1 or does or 
correspond to index 2/F2? Since Input is 1-based and o/p is 0-based. Or is 
0-based indexing is only for internal representation and what you get back at 
the end of regression is essentially 1-based indexed like your input so 0.11 
maps onto  from F1and so on?


 Date: Mon, 22 Dec 2014 16:31:57 -0800
 Subject: Re: Interpreting MLLib's linear regression o/p
 From: men...@gmail.com
 To: ssti...@live.com
 CC: user@spark.apache.org
 
 Did you check the indices in the LIBSVM data and the master file? Do
 they match? -Xiangrui
 
 On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote:
  Hi All,
  I use LIBSVM format to specify my input feature vector, which used 1-based
  index. When I run regression the o/p is 0-indexed based. I have a master
  lookup file that maps back these indices to what they stand or. However, I
  need to add offset of 2 and not 1 to the regression outcome during the
  mapping. So for example to map the index of 800 from the regression output
  file, I look for 802 in my master lookup file and then things make sense. I
  can understand adding offset of 1, but not sure why adding offset 2 is
  working fine. Have others seem something like this as well?
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Announcing Spark Packages

2014-12-22 Thread Nicholas Chammas

Hitesh,

From your link http://www.apache.org/foundation/marks/#domains:

You may not use ASF trademarks such as “Apache” or “ApacheFoo” or “Foo” in
your own domain names if that use would be likely to confuse a relevant
consumer about the source of software or services provided through your
website, without written approval of the VP, Apache Brand Management or
designee.

The title on the packages website is “A community index of packages for
Apache Spark.” Furthermore, the footnote of the website reads “Spark
Packages is a community site hosting modules that are not part of Apache
Spark.”

I think there’s nothing on there that would “confuse a relevant consumer
about the source of software”. It’s pretty clear that the Spark Packages
name is well within the ASF’s guidelines.

Have I misunderstood the ASF’s policy?

Nick


On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote:

 Hello Xiangrui,

 If you have not already done so, you should look at http://www.apache.org/
 foundation/marks/#domains for the policy on use of ASF trademarked terms
 in domain names.

 thanks
 — Hitesh

 On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:

  Dear Spark users and developers,
 
  I’m happy to announce Spark Packages (http://spark-packages.org), a
  community package index to track the growing number of open source
  packages and libraries that work with Apache Spark. Spark Packages
  makes it easy for users to find, discuss, rate, and install packages
  for any version of Spark, and makes it easy for developers to
  contribute packages.
 
  Spark Packages will feature integrations with various data sources,
  management tools, higher level domain-specific libraries, machine
  learning algorithms, code samples, and other Spark content. Thanks to
  the package authors, the initial listing of packages includes
  scientific computing libraries, a job execution server, a connector
  for importing Avro data, tools for launching Spark on Google Compute
  Engine, and many others.
 
  I’d like to invite you to contribute and use Spark Packages and
  provide feedback! As a disclaimer: Spark Packages is a community index
  maintained by Databricks and (by design) will include packages outside
  of the ASF Spark project. We are excited to help showcase and support
  all of the great work going on in the broader Spark community!
 
  Cheers,
  Xiangrui
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Announcing Spark Packages

2014-12-22 Thread Nicholas Chammas

Okie doke! (I just assumed there was an issue since the policy was brought
up.)

On Mon Dec 22 2014 at 8:33:53 PM Patrick Wendell pwend...@gmail.com wrote:

 Hey Nick,

 I think Hitesh was just trying to be helpful and point out the policy
 - not necessarily saying there was an issue. We've taken a close look
 at this and I think we're in good shape her vis-a-vis this policy.

 - Patrick

 On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Hitesh,
 
  From your link:
 
  You may not use ASF trademarks such as Apache or ApacheFoo or Foo
 in
  your own domain names if that use would be likely to confuse a relevant
  consumer about the source of software or services provided through your
  website, without written approval of the VP, Apache Brand Management or
  designee.
 
  The title on the packages website is A community index of packages for
  Apache Spark. Furthermore, the footnote of the website reads Spark
  Packages is a community site hosting modules that are not part of Apache
  Spark.
 
  I think there's nothing on there that would confuse a relevant consumer
  about the source of software. It's pretty clear that the Spark Packages
  name is well within the ASF's guidelines.
 
  Have I misunderstood the ASF's policy?
 
  Nick
 
 
  On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah hit...@apache.org wrote:
 
  Hello Xiangrui,
 
  If you have not already done so, you should look at
  http://www.apache.org/foundation/marks/#domains for the policy on use
 of ASF
  trademarked terms in domain names.
 
  thanks
  -- Hitesh
 
  On Dec 22, 2014, at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:
 
   Dear Spark users and developers,
  
   I'm happy to announce Spark Packages (http://spark-packages.org), a
   community package index to track the growing number of open source
   packages and libraries that work with Apache Spark. Spark Packages
   makes it easy for users to find, discuss, rate, and install packages
   for any version of Spark, and makes it easy for developers to
   contribute packages.
  
   Spark Packages will feature integrations with various data sources,
   management tools, higher level domain-specific libraries, machine
   learning algorithms, code samples, and other Spark content. Thanks to
   the package authors, the initial listing of packages includes
   scientific computing libraries, a job execution server, a connector
   for importing Avro data, tools for launching Spark on Google Compute
   Engine, and many others.
  
   I'd like to invite you to contribute and use Spark Packages and
   provide feedback! As a disclaimer: Spark Packages is a community index
   maintained by Databricks and (by design) will include packages outside
   of the ASF Spark project. We are excited to help showcase and support
   all of the great work going on in the broader Spark community!
  
   Cheers,
   Xiangrui
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

Re: spark streaming python + kafka

2014-12-22 Thread Davies Liu

There is a WIP pull request[1] working on this, it should be merged
into master soon.

[1] https://github.com/apache/spark/pull/3715

On Fri, Dec 19, 2014 at 2:15 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
 Hi ,
I've just seen that streaming spark supports python from 1.2 version.
 Question, does spark streaming (python version ) supports kafka integration?
 Thanks
 Oleg.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Who manage the log4j appender while running spark on yarn?

2014-12-22 Thread WangTaoTheTonic

After some discussions with Hadoop guys, I got how the mechanism works.
If we don't add -Dlog4j.configuration into java options to the container(AM
or executors), they will use log4j.properties(if any) under container's
classpath(extraClasspath plus yarn.application.classpath).

If we wanna custom our log4j configuration, we should add
spark.executor.extraJavaOptions=-Dlog4j.configuration=/path/to/log4j.properties
or
spark.yarn.am.extraJavaOptions=-Dlog4j.configuration=/path/to/log4j.properties
in spark-defaults.conf file.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Who-manage-the-log4j-appender-while-running-spark-on-yarn-tp20778p20818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Who manage the log4j appender while running spark on yarn?

2014-12-22 Thread Marcelo Vanzin

If you don't specify your own log4j.properties, Spark will load the
default one (from
core/src/main/resources/org/apache/spark/log4j-defaults.properties,
which ends up being packaged with the Spark assembly).

You can easily override the config file if you want to, though; check
the Debugging section of the Running on YARN docs.

On Fri, Dec 19, 2014 at 12:37 AM, WangTaoTheTonic
barneystin...@aliyun.com wrote:
 Hi guys,

 I recently ran spark on yarn and found spark didn't set any log4j properties
 file in configuration or code. And the log4j logs was writing into stderr
 file under ${yarn.nodemanager.log-dirs}/application_${appid}.

 I wanna know which side(spark or hadoop) controll the appender? Have found
 that related disscussion here:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logging-strategy-on-YARN-td8751.html,
 but I think spark code has changed a lot since then.

 Any one could offer some guide? Thanks.





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Who-manage-the-log4j-appender-while-running-spark-on-yarn-tp20778.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

broadcasting object issue

2014-12-22 Thread Henry Hung

Hi All,

I have a problem with broadcasting a serialize class object that returned by 
another not-serialize class, here is the sample code:

class A extends java.io.Serializable {
def halo(): String = halo
}

class B {
def getA() = new A
}

val list = List(1)

val b = new B
val a = b.getA

val p = sc.parallelize(list)

// this will fail
val bcA = sc.broadcast(a)
p.map(x = {
bcA.value.halo()
})

// this will success
val bcA = sc.broadcast(new A)
p.map(x = {
bcA.value.halo()
})


A is a serializable class, where B is not-serialize.
If I create a new object A through B method getA(), the map process will failed 
with exception org.apache.spark.SparkException: Task not serializable, Caused 
by: java.io.NotSerializableException: $iwC$$iwC$B

I don't know why spark will check if the B class serializable or not, is there 
a way to code this?

Best regards,
Henry


The privileged confidential information contained in this email is intended for 
use only by the addressees as indicated by the original sender of this email. 
If you are not the addressee indicated in this email or are not responsible for 
delivery of the email to such a person, please kindly reply to the sender 
indicating this fact and delete all copies of it from your computer and network 
server immediately. Your cooperation is highly appreciated. It is advised that 
any unauthorized use of confidential information of Winbond is strictly 
prohibited; and any information in this email irrelevant to the official 
business of Winbond shall be deemed as neither given nor endorsed by Winbond.

Re: custom python converter from HBase Result to tuple

2014-12-22 Thread Antony Mayi

using hbase 0.98.6
there is no stack trace, just this short error.
just noticed it does the fallback to toString as in the message as this is what 
I get back to python:

hbase_rdd.collect()
[(u'key1', u'List(cf1:12345:14567890, cf2:123:14567896)')]
so the question is why it falls back to toString?
thanks,Antony.
 

 On Monday, 22 December 2014, 20:09, Ted Yu yuzhih...@gmail.com wrote:
   
 

 Which HBase version are you using ?
Can you show the full stack trace ?
Cheers
On Mon, Dec 22, 2014 at 11:02 AM, Antony Mayi antonym...@yahoo.com.invalid 
wrote:

Hi,

can anyone please give me some help how to write custom converter of hbase data 
to (for example) tuples of ((family, qualifier, value), ) for pyspark:

I was trying something like (here trying to tuples of 
(family:qualifier:value, )):


class HBaseResultToTupleConverter extends Converter[Any, List[String]] {
  override def convert(obj: Any): List[String] = {
    val result = obj.asInstanceOf[Result]
    result.rawCells().map(cell = 
List(Bytes.toString(CellUtil.cloneFamily(cell)),
      Bytes.toString(CellUtil.cloneQualifier(cell)),
      Bytes.toString(CellUtil.cloneValue(cell))).mkString(:)
    ).toList
  }
}


but then I get a error:

14/12/22 16:27:40 WARN python.SerDeUtil:
Failed to pickle Java object as value: $colon$colon, falling back
to 'toString'. Error: couldn't introspect javabean: 
java.lang.IllegalArgumentException: wrong number of arguments


does anyone have a hint?

Thanks,
Antony.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Joins in Spark

2014-12-22 Thread Deep Pradhan

Hi,
I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair
RDD. I want to take three way join of these two. Joins work only when both
the RDDs are pair RDDS right? So, how am I supposed to take a three way
join of these RDDs?

Thank You

Joins in Spark

2014-12-22 Thread Deep Pradhan

Hi,
I have two RDDs, vertices and edges. Vertices is an RDD and edges is a pair
RDD. I want to take three way join of these two. Joins work only when both
the RDDs are pair RDDS right? So, how am I supposed to take a three way
join of these RDDs?

Thank You

Re: broadcasting object issue

2014-12-22 Thread madhu phatak

Hi,
 Just ran your code on spark-shell.  If you replace

 val bcA = sc.broadcast(a)


with

val bcA = sc.broadcast(new B().getA)


it seems to work. Not sure why.


On Tue, Dec 23, 2014 at 9:12 AM, Henry Hung ythu...@winbond.com wrote:

  Hi All,



 I have a problem with broadcasting a serialize class object that returned
 by another not-serialize class, here is the sample code:



 class A extends java.io.Serializable {

 def halo(): String = halo

 }



 class B {

 def getA() = new A

 }



 val list = List(1)



 val b = new B

 val a = b.getA



 val p = sc.parallelize(list)



 // this will fail

 val bcA = sc.broadcast(a)

 p.map(x = {

 bcA.value.halo()

 })



 // this will success

 val bcA = sc.broadcast(new A)

 p.map(x = {

 bcA.value.halo()

 })





 A is a serializable class, where B is not-serialize.

 If I create a new object A through B method getA(), the map process will
 failed with exception “org.apache.spark.SparkException: Task not
 serializable, Caused by: java.io.NotSerializableException: $iwC$$iwC$B”



 I don’t know why spark will check if the B class serializable or not, is
 there a way to code this?



 Best regards,

 Henry

 --
 The privileged confidential information contained in this email is
 intended for use only by the addressees as indicated by the original sender
 of this email. If you are not the addressee indicated in this email or are
 not responsible for delivery of the email to such a person, please kindly
 reply to the sender indicating this fact and delete all copies of it from
 your computer and network server immediately. Your cooperation is highly
 appreciated. It is advised that any unauthorized use of confidential
 information of Winbond is strictly prohibited; and any information in this
 email irrelevant to the official business of Winbond shall be deemed as
 neither given nor endorsed by Winbond.




-- 
Regards,
Madhukara Phatak
http://www.madhukaraphatak.com

Re: Joins in Spark

2014-12-22 Thread madhu phatak

Hi,
 You can map your vertices rdd as follow

val pairVertices = verticesRDD.map(vertice = (vertice,null))

the above gives you a pairRDD. After join make sure that you remove
superfluous null value.

On Tue, Dec 23, 2014 at 10:36 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have two RDDs, vertices and edges. Vertices is an RDD and edges is a
 pair RDD. I want to take three way join of these two. Joins work only when
 both the RDDs are pair RDDS right? So, how am I supposed to take a three
 way join of these RDDs?

 Thank You




-- 
Regards,
Madhukara Phatak
http://www.madhukaraphatak.com

Joins in Spark

2014-12-22 Thread pradhandeep

Hi,
I have two RDDs, veritces which is an RDD and edges, which is a pair RDD. I
have to do a three-way join of these two. Joins work only when both the RDDs
are pair RDDs, so how can we perform a three-way join of these RDDs?

Thank You



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joins-in-Spark-tp20819.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Fwd: Joins in Spark

2014-12-22 Thread Deep Pradhan

This gives me two pair RDDs, one is the edgesRDD and another is verticesRDD
with each vertex padded with value null. But I have to take a three way
join of these two RDD and I have only one common attribute in these two
RDDs. How can I go about doing the three join?

Consistent hashing of RDD row

2014-12-22 Thread lev

Hello,

I have a process where I need to create a random number for each row in an
RDD.
That new RDD will be used in a few iteration, and it is necessary that
between iterations the numbers won't change
(i.e., if a partition get evicted from the cache, the numbers of that
partition will be regenerated the same)
One way to solve it is to persist the RDD (after the random numbers are
created) on the disk, but it might be evicted if we run out of space on the
disk, no?

My idea is to do zipWithIndex on my original RDD, and for each row, create a
new random generator with the index as the seed.

I would like to know if zipWithIndex will match the same index if its get
evicted from the cache,
for example:

rdd1.join(rdd2).zipWithIndex()
if the join gets recalculated, the rows will get the same index?

or in:
val rdd = hiveContext.sql(...).zipWithIndex()
if the partitions of the query get evicted and recalculated, will the index
stay the same?

I'd love to hear your thoughts on the matter.

Thanks,
Lev.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Consistent-hashing-of-RDD-row-tp20820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL with a sorted file

2014-12-22 Thread Jerry Raj


Michael,
Thanks. Is this still turned off in the released 1.2? Is it possible to 
turn it on just to get an idea of how much of a difference it makes?


-Jerry

On 05/12/14 12:40 am, Michael Armbrust wrote:

I'll add that some of our data formats will actual infer this sort of
useful information automatically.  Both parquet and cached inmemory
tables keep statistics on the min/max value for each column.  When you
have predicates over these sorted columns, partitions will be eliminated
if they can't possibly match the predicate given the statistics.

For parquet this is new in Spark 1.2 and it is turned off by defaults
(due to bugs we are working with the parquet library team to fix).
Hopefully soon it will be on by default.

On Wed, Dec 3, 2014 at 8:44 PM, Cheng, Hao hao.ch...@intel.com
mailto:hao.ch...@intel.com wrote:

You can try to write your own Relation with filter push down or use
the ParquetRelation2 for workaround.

(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)

Cheng Hao

-Original Message-
From: Jerry Raj [mailto:jerry@gmail.com
mailto:jerry@gmail.com]
Sent: Thursday, December 4, 2014 11:34 AM
To: user@spark.apache.org mailto:user@spark.apache.org
Subject: Spark SQL with a sorted file

Hi,
If I create a SchemaRDD from a file that I know is sorted on a
certain field, is it possible to somehow pass that information on to
Spark SQL so that SQL queries referencing that field are optimized?

Thanks
-Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org For additional commands,
e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

55 matches

Mail list logo