DataFrames in Spark - Performance when interjected with RDDs

2015-09-07 Thread Pallavi Rao
Hello All,
I had a question regarding the performance optimization (Catalyst
Optimizer) of DataFrames. I understand that DataFrames are interoperable
with RDDs. If I switch back and forth between DataFrames and RDDs, does the
performance optimization still kick-in? I need to switch to RDDs to reuse
some previously written functions that had been coded up using RDDs.

Are there are any recommendations/best practices, in terms of performance
tuning, that need to be followed while using a combination of DataFrames
and RDDs?

Thank you for your time.

Regards,
Pallavi.

-- 
_
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.


Code generation for GPU

2015-09-07 Thread lonikar
Hi,I am speaking in Spark Europe summit on exploiting GPUs for columnar
DataFrame operations. I was going through various blogs, talks and JIRAs
given by all you and trying to figure out where to make changes for this
proposal.First of all, I must thank the recent progress in project tungsten
that has made my job easier. The changes for code generation make it
possible to allow me to generate OpenCL code for expressions instead of
existing java/scala code and run the OpenCL code on GPUs through a Java
library JavaCL.However, before starting the work, I have a few
questions/doubts as below:   * I found where the code generation happens in
spark code from the blogs
https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html,
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
and
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html.
However, I could not find where is the generated code executed? A major part
of my changes will be there since this executor will now have to send
vectors of columns to GPU RAM, invoke execution, and get the results back to
CPU RAM. Thus, the existing executor will significantly change.   * On the
project tungsten blog, in the third Code Generation section, you mention
that you plan to increase the level of code generation from record-at-a-time
expression evaluation to vectorized expression evaluation. Has this been
implemented? If not, how do I implement this? I will need access to columnar
ByteBuffer objects in DataFrame to do this. Having row by row access to data
will defeat this exercise. In particular, I need access to
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
in the executor of the generated code.   * One thing that confuses me is the
changes from 1.4 to 1.5 possibly due to JIRA
https://issues.apache.org/jira/browse/SPARK-7956 and pull request
https://github.com/apache/spark/pull/6479/files. This changed the code
generation from quasiquotes (q) to string s operator. This makes it simpler
for me to generate OpenCL code which is string based. The question, is this
branch stable now? Should I make my changes on spark 1.4 or spark 1.5 or
master branch?How do I tune the batch size (number of rows in the
ByteBuffer)? Is it through the property
spark.sql.inMemoryColumnarStorage.batchSize?Thanks in
anticipation,KiranPS:Other things I found useful were:Spark DataFrames:
https://www.brighttalk.com/webcast/12891/166495Apache Spark 1.5:
https://www.brighttalk.com/webcast/12891/168177The links to
JavaCL/ScalaCL:Library to execute OpenCL code through Java:
https://github.com/nativelibs4java/ScalaCLLibrary to convert Scala code to
OpenCL and execute on GPUs: https://github.com/nativelibs4java/JavaCL



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Code-generation-for-GPU-tp24587.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark SQL - UDF for scoring a model - take $"*"

2015-09-07 Thread Night Wolf
Is it possible to have a UDF which takes a variable number of arguments?

e.g. df.select(myUdf($"*")) fails with

org.apache.spark.sql.AnalysisException: unresolved operator 'Project
[scalaUDF(*) AS scalaUDF(*)#26];

What I would like to do is pass in a generic data frame which can be then
passed to a UDF which does scoring of a model. The UDF needs to know the
schema to map column names in the model to columns in the DataFrame.

The model has 100s of factors (very wide), so I can't just have a scoring
UDF that has 500 parameters (for obvious reasons).

Cheers,
~N


Code generation for GPU

2015-09-07 Thread lonikar
Hi,

I am speaking in Spark Europe summit on exploiting GPUs for columnar
DataFrame operations. I was going through various blogs, talks and JIRAs
given by all you and trying to figure out where to make changes for this
proposal.

First of all, I must thank the recent progress in project tungsten that has
made my job easier. The changes for code generation make it possible to
allow me to generate OpenCL code for expressions instead of existing
java/scala code and run the OpenCL code on GPUs through a Java library
JavaCL.

However, before starting the work, I have a few questions/doubts as below:

   * I found where the code generation happens in spark code from the blogs
https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html,
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
and
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html.
However, I could not find where is the generated code executed? A major part
of my changes will be there since this executor will now have to send
vectors of columns to GPU RAM, invoke execution, and get the results back to
CPU RAM. Thus, the existing executor will significantly change.
   * On the project tungsten blog, in the third Code Generation section, you
mention that you plan to increase the level of code generation from
record-at-a-time expression evaluation to vectorized expression evaluation.
Has this been implemented? If not, how do I implement this? I will need
access to columnar ByteBuffer objects in DataFrame to do this. Having row by
row access to data will defeat this exercise. In particular, I need access
to
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
in the executor of the generated code.
   * One thing that confuses me is the changes from 1.4 to 1.5 possibly due
to JIRA https://issues.apache.org/jira/browse/SPARK-7956 and pull request
https://github.com/apache/spark/pull/6479/files. This changed the code
generation from quasiquotes (q) to string s operator. This makes it simpler
for me to generate OpenCL code which is string based. The question, is this
branch stable now? Should I make my changes on spark 1.4 or spark 1.5 or
master branch?
How do I tune the batch size (number of rows in the ByteBuffer)? Is it
through the property spark.sql.inMemoryColumnarStorage.batchSize?

Thanks in anticipation,

Kiran
PS:

Other things I found useful were:

Spark DataFrames: https://www.brighttalk.com/webcast/12891/166495
Apache Spark 1.5: https://www.brighttalk.com/webcast/12891/168177

The links to JavaCL/ScalaCL:

Library to execute OpenCL code through Java:
https://github.com/nativelibs4java/ScalaCL
Library to convert Scala code to OpenCL and execute on GPUs:
https://github.com/nativelibs4java/JavaCL




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Code-generation-for-GPU-tp24588.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Python Spark Streaming example with textFileStream does not work. Why?

2015-09-07 Thread Kamil Khadiev
I think that problem also depends on file system,
I use mac and My program found file, but only when I created new, but not
rename or move

And in logs
15/09/07 10:44:52 INFO FileInputDStream: New files at time 1441611892000 ms:
I found my file

But I don't see any processing of file in logs

2015-09-07 8:44 GMT+03:00 Kamil Khadiev :

> Thank you.
>
> But it still does not work.
>
> Also I did another mistake: I wrote name of file, but not directory.
>
> I fix it:
>   conf = (SparkConf()
>  .setMaster("local")
>  .setAppName("My app")
>  .set("spark.executor.memory", "1g"))
> sc = SparkContext(conf = conf)
> ssc = StreamingContext(sc, 1)
> lines = ssc.textFileStream('../inputs/streaminginputs')
> counts = lines.flatMap(lambda line: line.split(" "))\
>   .map(lambda x: (x, 1))\
>   .reduceByKey(lambda a, b: a+b)
> counts.pprint()
> ssc.start()
> ssc.awaitTermination()
>
> I add file to '../inputs/streaminginputs' directory, then rename it, also
> try to copy new.
> But it does not help.  I have same situation in console.
> Also I have logs like this every second (But I haven't expected logs about
> new file):
>
> ---
> Time: 2015-09-07 08:39:29
> ---
>
> 15/09/07 08:39:30 INFO FileInputDStream: Finding new files took 0 ms
> 15/09/07 08:39:30 INFO FileInputDStream: New files at time 144160437
> ms:
>
> 15/09/07 08:39:30 INFO JobScheduler: Added jobs for time 144160437 ms
> 15/09/07 08:39:30 INFO JobScheduler: Starting job streaming job
> 144160437 ms.0 from job set of time 144160437 ms
> 15/09/07 08:39:30 INFO SparkContext: Starting job: runJob at
> PythonRDD.scala:362
> 15/09/07 08:39:30 INFO DAGScheduler: Registering RDD 163 (call at
> /Library/Python/2.7/site-packages/py4j/java_gateway.py:1206)
> 15/09/07 08:39:30 INFO DAGScheduler: Got job 20 (runJob at
> PythonRDD.scala:362) with 1 output partitions (allowLocal=true)
> 15/09/07 08:39:30 INFO DAGScheduler: Final stage: Stage 41(runJob at
> PythonRDD.scala:362)
> 15/09/07 08:39:30 INFO DAGScheduler: Parents of final stage: List(Stage 40)
> 15/09/07 08:39:30 INFO DAGScheduler: Missing parents: List()
> 15/09/07 08:39:30 INFO DAGScheduler: Submitting Stage 41 (PythonRDD[167]
> at RDD at PythonRDD.scala:43), which has no missing parents
> 15/09/07 08:39:30 INFO MemoryStore: ensureFreeSpace(5952) called with
> curMem=31088, maxMem=278019440
> 15/09/07 08:39:30 INFO MemoryStore: Block broadcast_20 stored as values in
> memory (estimated size 5.8 KB, free 265.1 MB)
> 15/09/07 08:39:30 INFO MemoryStore: ensureFreeSpace(4413) called with
> curMem=37040, maxMem=278019440
> 15/09/07 08:39:30 INFO MemoryStore: Block broadcast_20_piece0 stored as
> bytes in memory (estimated size 4.3 KB, free 265.1 MB)
> 15/09/07 08:39:30 INFO BlockManagerInfo: Added broadcast_20_piece0 in
> memory on localhost:57739 (size: 4.3 KB, free: 265.1 MB)
> 15/09/07 08:39:30 INFO BlockManagerMaster: Updated info of block
> broadcast_20_piece0
> 15/09/07 08:39:30 INFO SparkContext: Created broadcast 20 from broadcast
> at DAGScheduler.scala:839
> 15/09/07 08:39:30 INFO DAGScheduler: Submitting 1 missing tasks from Stage
> 41 (PythonRDD[167] at RDD at PythonRDD.scala:43)
> 15/09/07 08:39:30 INFO TaskSchedulerImpl: Adding task set 41.0 with 1 tasks
> 15/09/07 08:39:30 INFO TaskSetManager: Starting task 0.0 in stage 41.0
> (TID 20, localhost, PROCESS_LOCAL, 1056 bytes)
> 15/09/07 08:39:30 INFO Executor: Running task 0.0 in stage 41.0 (TID 20)
> 15/09/07 08:39:30 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty
> blocks out of 0 blocks
> 15/09/07 08:39:30 INFO ShuffleBlockFetcherIterator: Started 0 remote
> fetches in 0 ms
> 15/09/07 08:39:30 INFO PythonRDD: Times: total = 1, boot = -1005, init =
> 1006, finish = 0
> 15/09/07 08:39:30 INFO PythonRDD: Times: total = 2, boot = -1004, init =
> 1006, finish = 0
> 15/09/07 08:39:30 INFO Executor: Finished task 0.0 in stage 41.0 (TID 20).
> 932 bytes result sent to driver
> 15/09/07 08:39:30 INFO TaskSetManager: Finished task 0.0 in stage 41.0
> (TID 20) in 7 ms on localhost (1/1)
> 15/09/07 08:39:30 INFO TaskSchedulerImpl: Removed TaskSet 41.0, whose
> tasks have all completed, from pool
> 15/09/07 08:39:30 INFO DAGScheduler: Stage 41 (runJob at
> PythonRDD.scala:362) finished in 0.008 s
> 15/09/07 08:39:30 INFO DAGScheduler: Job 20 finished: runJob at
> PythonRDD.scala:362, took 0.015576 s
> 15/09/07 08:39:30 INFO JobScheduler: Finished job streaming job
> 144160437 ms.0 from job set of time 144160437 ms
>
> 2015-09-04 20:14 GMT+03:00 Davies Liu :
>
>> Spark Streaming only process the NEW files after it started, so you
>> should point it to a directory, and copy the file into it after
>> started.
>>
>> On Fri, Sep 4, 2015 at 5:15 AM, Kamilbek  wrote:
>> > I use spark 1.3.1 and Python 2.7
>> >
>> > It is 

Java UDFs in GROUP BY expressions

2015-09-07 Thread James Aley
Hi everyone,

I raised this JIRA ticket back in July:
https://issues.apache.org/jira/browse/SPARK-9435

The problem is that it seems Spark SQL doesn't recognise columns we
transform with a UDF when referenced in the GROUP BY clause. There's a
minimal reproduction Java file attached to illustrate the issue.

The equivalent code from Scala seems to work fine for me. Is anyone else
seeing this problem? For us, the attached code fails every time on Spark
1.4.1


Thanks,

James


Exception when restoring spark streaming with batch RDD from checkpoint.

2015-09-07 Thread ZhengHanbin
Hi,

I am using spark streaming to join every RDD of a DStream to a stand alone RDD 
to generate a new DStream as followed:

def joinWithBatchEvent(contentFeature: RDD[(String, String)],
   batchEvent: DStream[((String, String), (Long, Double, 
Double))]) = {
  batchEvent.map(event => {
(event._1._2, (event._1._1, event._2._1, event._2._2, event._2._3))
  }).transform(eventRDD => {
eventRDD.leftOuterJoin(contentFeature).map(result =>
  (result._2._1._1, (result._1, result._2._1._2, result._2._1._3, 
result._2._1._4, result._2._2))
)
  })
}

It works well when it start from a new StreamContext.
But if the StreamContext is restored from checkpoint, there will be an 
exception as followed and the Graph can not be setup.
Do you know how to solve this problem? Thanks very much!

5/09/07 14:07:18 INFO spark.SparkContext: Starting job: saveAsTextFiles at 
CFBModel.scala:49
15/09/07 14:07:18 INFO scheduler.DAGScheduler: Registering RDD 12 (repartition 
at EventComponent.scala:64)
15/09/07 14:07:18 INFO scheduler.DAGScheduler: Registering RDD 17 (flatMap at 
CFBModel.scala:25)
15/09/07 14:07:18 INFO scheduler.DAGScheduler: Registering RDD 20 (map at 
ContentFeature.scala:100)
15/09/07 14:07:18 WARN scheduler.DAGScheduler: Creating new stage failed due to 
exception - job: 1
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.findEntryImpl(FlatHashTable.scala:123)
at 
scala.collection.mutable.FlatHashTable$class.containsEntry(FlatHashTable.scala:119)
at scala.collection.mutable.HashSet.containsEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.contains(HashSet.scala:58)
at scala.collection.GenSetLike$class.apply(GenSetLike.scala:43)
at scala.collection.mutable.AbstractSet.apply(Set.scala:45)
at 
org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:336)
at 
org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:355)
at 
org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:317)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:218)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:301)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:298)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:298)
at 
org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:310)
at 
org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:244)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:731)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/09/07 14:07:18 INFO scheduler.DAGScheduler: Job 1 failed: saveAsTextFiles at 
CFBModel.scala:49, took 0.016406 s
15/09/07 14:07:18 ERROR scheduler.JobScheduler: Error running job streaming job 
144160590 ms.0
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.findEntryImpl(FlatHashTable.scala:123)
at 
scala.collection.mutable.FlatHashTable$class.containsEntry(FlatHashTable.scala:119)
at scala.collection.mutable.HashSet.containsEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.contains(HashSet.scala:58)
at scala.collection.GenSetLike$class.apply(GenSetLike.scala:43)
at scala.collection.mutable.AbstractSet.apply(Set.scala:45)
at 
org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:336)
at 
org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:355)
at 
org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:317)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:218)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:301)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:298)
 

Zeppelin + Spark on EMR

2015-09-07 Thread shahab
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the
script provided by Anders (
https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup
Zeppelin. The Zeppelin can connect to Spark but when I got error when I run
the tutorials. and I get the following error:

...FileNotFoundException: File
file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar
does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin
with EMR .

best,
/Shahab


Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele
Hi All I have been trying to send my application related logs to socket so
that we can write log stash and check the application logs.

here is my log4j.property file

main.logger=RFA,SA

log4j.appender.SA=org.apache.log4j.net.SocketAppender
log4j.appender.SA.Port=4560
log4j.appender.SA.RemoteHost=hadoop07.housing.com
log4j.appender.SA.ReconnectionDelay=1
log4j.appender.SA.Application=NM-${user.dir}
# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.apache.hadoop=WARN


I am launching my spark job using below common on YARN-cluster mode

*spark-submit --name data-ingestion --master yarn-cluster --conf
spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
 --files
/usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
--conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
--class com.housing.spark.streaming.Binning
/usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*


*Can anybody please guide me why i am not getting the logs the socket?*


*I followed many pages listing below without success*
http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop


OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Tóth
Hi,

When I execute the Spark ML Logisitc Regression example in pyspark I run
into an OutOfMemory exception. I'm wondering if any of you experienced the
same or has a hint about how to fix this.

The interesting bit is that I only get the exception when I try to write
the result DataFrame into a file. If I only "print" any of the results, it
all works fine.

My Setup:
Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
nightly build)
Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
-DzincPort=3034

I'm using the default resource setup
15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
containers, each with 1 cores and 1408 MB memory including 384 MB overhead
15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
capability: )
15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
capability: )

The script I'm executing:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("pysparktest")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vector, Vectors

training = sc.parallelize((
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5

training_df = training.toDF()

from pyspark.ml.classification import LogisticRegression

reg = LogisticRegression()

reg.setMaxIter(10).setRegParam(0.01)
model = reg.fit(training.toDF())

test = sc.parallelize((
  LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
  LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5

out_df = model.transform(test.toDF())

out_df.write.parquet("/tmp/logparquet")

And the command:
spark-submit --master yarn --deploy-mode cluster spark-ml.py

Thanks,
z


Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Tóth
Aaand, the error! :)

Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"org.apache.hadoop.hdfs.PeerCache@4e000abf"
Exception in thread "Thread-7"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Thread-7"
Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Reporter"
Exception in thread "qtp2115718813-47"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "qtp2115718813-47"

Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"

Log Type: stdout

Log Upload Time: Mon Sep 07 09:03:01 -0400 2015

Log Length: 986

Traceback (most recent call last):
  File "spark-ml.py", line 33, in 
out_df.write.parquet("/tmp/logparquet")
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
line 422, in parquet
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
line 36, in deco
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError



On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth  wrote:

> Hi,
>
> When I execute the Spark ML Logisitc Regression example in pyspark I run
> into an OutOfMemory exception. I'm wondering if any of you experienced the
> same or has a hint about how to fix this.
>
> The interesting bit is that I only get the exception when I try to write
> the result DataFrame into a file. If I only "print" any of the results, it
> all works fine.
>
> My Setup:
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
> -DzincPort=3034
>
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
>
> The script I'm executing:
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("pysparktest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
>
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
>
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
>   LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
>
> training_df = training.toDF()
>
> from pyspark.ml.classification import LogisticRegression
>
> reg = LogisticRegression()
>
> reg.setMaxIter(10).setRegParam(0.01)
> model = reg.fit(training.toDF())
>
> test = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
>   LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5
>
> out_df = model.transform(test.toDF())
>
> out_df.write.parquet("/tmp/logparquet")
>
> And the command:
> spark-submit --master yarn --deploy-mode cluster spark-ml.py
>
> Thanks,
> z
>


Re: Drools and Spark Integration - Need Help

2015-09-07 Thread Akhil Das
How are you integrating it with spark?

Thanks
Best Regards

On Fri, Sep 4, 2015 at 12:11 PM, Shiva moorthy 
wrote:

> Hi Team,
>
> I am able to integrate Drools with Apache spark but after integration my
> application runs slower.
> Could you please give ideas about how Drools can be efficiently integrated
> with Spark?
> Appreciate your help.
>
> Thanks and Regards,
> Shiva
>
>
>


Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-07 Thread Terry Hole
Xiangrui,

Do you have any idea how to make this work?

Thanks
- Terry

Terry Hole 于2015年9月6日星期日 17:41写道:

> Sean
>
> Do you know how to tell decision tree that the "label" is a binary or set
> some attributes to dataframe to carry number of classes?
>
> Thanks!
> - Terry
>
> On Sun, Sep 6, 2015 at 5:23 PM, Sean Owen  wrote:
>
>> (Sean)
>> The error suggests that the type is not a binary or nominal attribute
>> though. I think that's the missing step. A double-valued column need
>> not be one of these attribute types.
>>
>> On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole 
>> wrote:
>> > Hi, Owen,
>> >
>> > The dataframe "training" is from a RDD of case class:
>> RDD[LabeledDocument],
>> > while the case class is defined as this:
>> > case class LabeledDocument(id: Long, text: String, label: Double)
>> >
>> > So there is already has the default "label" column with "double" type.
>> >
>> > I already tried to set the label column for decision tree as this:
>> > val lr = new
>> >
>> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label")
>> > It raised the same error.
>> >
>> > I also tried to change the "label" to "int" type, it also reported error
>> > like following stack, I have no idea how to make this work.
>> >
>> > java.lang.IllegalArgumentException: requirement failed: Column label
>> must be
>> > of type DoubleType but was actually IntegerType.
>> > at scala.Predef$.require(Predef.scala:233)
>> > at
>> >
>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
>> > at
>> >
>> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
>> > at
>> >
>> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
>> > at
>> > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
>> > at
>> >
>> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
>> > at
>> >
>> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
>> > at
>> >
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>> > at
>> >
>> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>> > at
>> > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
>> > at
>> org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162)
>> > at
>> > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
>> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116)
>> > at
>> >
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>> > at
>> >
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
>> > at
>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
>> > at
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
>> > at $iwC$$iwC$$iwC$$iwC.(:70)
>> > at $iwC$$iwC$$iwC.(:72)
>> > at $iwC$$iwC.(:74)
>> > at $iwC.(:76)
>> > at (:78)
>> > at .(:82)
>> > at .()
>> > at .(:7)
>> > at .()
>> > at $print()
>> >
>> > Thanks!
>> > - Terry
>> >
>> > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen  wrote:
>> >>
>> >> I think somewhere alone the line you've not specified your label
>> >> column -- it's defaulting to "label" and it does not recognize it, or
>> >> at least not as a binary or nominal attribute.
>> >>
>> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole 
>> wrote:
>> >> > Hi, Experts,
>> >> >
>> >> > I followed the guide of spark ml pipe to test DecisionTreeClassifier
>> on
>> >> > spark shell with spark 1.4.1, but always meets error like following,
>> do
>> >> > you
>> >> > have any idea how to fix this?
>> >> >
>> >> > The error stack:
>> >> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given
>> >> > input
>> >> > with invalid label column label, without the number of classes
>> >> > specified.
>> >> > See StringIndexer.
>> >> > at
>> >> >
>> >> >
>> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
>> >> > at
>> >> >
>> >> >
>> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
>> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>> >> > at
>> >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
>> >> > at
>> >> > 

Re: Access a Broadcast variable causes Spark to launch a second context

2015-09-07 Thread sstraub
never mind, all of this was caused because somewhere in my code I wrote `def`
instead of `val`, which caused `collectAsMap` to be executed on each call.
Not sure why Spark at some point decided to create a new context, though...

Anyway, sorry for the disturbance.



sstraub wrote
> Hi,
> 
> I'm working on a spark job that frequently iterates over huge RDDs and
> matches the elements against some Maps that easily fit into memory. So
> what I do is to broadcast that Map and reference it from my RDD.
> 
> Works like a charm, until at some point it doesn't, and I can't figure out
> why...
> Please have a look at this:
> 
>   def fun(sc: SparkContext, someRDD: RDD[(String)], someMap: RDD[(String,
> Double)]) = {
> // I want to access the Map multiple times, so I broadcast it
> val broadcast = sc.broadcast(someMap.collectAsMap())
> // the next line creates one job per element and executes
> collectAsMap() over and over again
> println(someRDD.take(100).map(s => broadcast.value.getOrElse(s,
> 0.0)).toList.mkString("\n"))
> // the next line creates a new spark context and crashes (only one
> spark context per JVM...)
> println(someRDD.map(s => broadcast.value.getOrElse(s,
> 0.0)).collect().mkString("\n"))
>   }
> 
> Here I'm doing just what I've described above: broadcast a Map and access
> the broadcast value while iterating over another RDD.
> 
> Now when I take a subset of the RDD (`take(100)`), Spark creates one job
> per ELEMENT (that's 100 jobs) where `collectAsMap` is called. Obviously,
> this takes quite a lot of time (~500 ms per element).
> When I actually want to map over the entire RDD, Spark tries to launch
> another Spark context and crashes the whole application.
> 
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in
> stage 37.0 (TID 106, localhost): org.apache.spark.SparkException: Only one
> SparkContext may be running in this JVM (see SPARK-2243). To ignore this
> error, set spark.driver.allowMultipleContexts = true.
> 
> I couldn't reproduce this error in a minimal working example, so there
> must be something in my pipeline that is messing things up. The error is
> 100% reproducible in my environment and the application runs fine as soon
> as I don't access this specific Map from this specific RDD.
> 
> Any idea what might cause this problem?
> Can I provide you with any other Information (besides posting >500 lines
> of code)?
> 
> cheers
> Sebastian





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-a-Broadcast-variable-causes-Spark-to-launch-a-second-context-tp24595p24596.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele
I also tried placing my costomized log4j.properties file under
src/main/resources still no luck.

won't above step modify the default YARN and spark  log4j.properties  ?

anyhow its still taking log4j.properties from YARn.



On 7 September 2015 at 19:25, Jeetendra Gangele 
wrote:

> anybody here to help?
>
>
>
> On 7 September 2015 at 17:53, Jeetendra Gangele 
> wrote:
>
>> Hi All I have been trying to send my application related logs to socket
>> so that we can write log stash and check the application logs.
>>
>> here is my log4j.property file
>>
>> main.logger=RFA,SA
>>
>> log4j.appender.SA=org.apache.log4j.net.SocketAppender
>> log4j.appender.SA.Port=4560
>> log4j.appender.SA.RemoteHost=hadoop07.housing.com
>> log4j.appender.SA.ReconnectionDelay=1
>> log4j.appender.SA.Application=NM-${user.dir}
>> # Ignore messages below warning level from Jetty, because it's a bit
>> verbose
>> log4j.logger.org.spark-project.jetty=WARN
>> log4j.logger.org.apache.hadoop=WARN
>>
>>
>> I am launching my spark job using below common on YARN-cluster mode
>>
>> *spark-submit --name data-ingestion --master yarn-cluster --conf
>> spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
>>  --files
>> /usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
>> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>> --conf
>> "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>> --class com.housing.spark.streaming.Binning
>> /usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*
>>
>>
>> *Can anybody please guide me why i am not getting the logs the socket?*
>>
>>
>> *I followed many pages listing below without success*
>>
>> http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
>>
>> http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
>>
>> http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop
>>
>>
>


Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-07 Thread Jörn Franke
Can you use a map or list with different properties as one parameter?
Alternatively a string where parameters are Comma-separated...

Le lun. 7 sept. 2015 à 8:35, Night Wolf  a écrit :

> Is it possible to have a UDF which takes a variable number of arguments?
>
> e.g. df.select(myUdf($"*")) fails with
>
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project
> [scalaUDF(*) AS scalaUDF(*)#26];
>
> What I would like to do is pass in a generic data frame which can be then
> passed to a UDF which does scoring of a model. The UDF needs to know the
> schema to map column names in the model to columns in the DataFrame.
>
> The model has 100s of factors (very wide), so I can't just have a scoring
> UDF that has 500 parameters (for obvious reasons).
>
> Cheers,
> ~N
>


Re: spark-shell does not see conf folder content on emr-4

2015-09-07 Thread Akhil Das
You can also create a link to /etc/spark/conf from /usr/lib/spark/

Thanks
Best Regards

On Fri, Sep 4, 2015 at 2:40 AM, Alexander Pivovarov 
wrote:

> Hi Everyone
>
> My question is specific to running spark-1.4.1 on emr-4.0.0
>
> spark installed to /usr/lib/spark
> conf folder linked to /etc/spark/conf
> spark-shell location /usr/bin/spark-shell
>
> I noticed that if I run spark-shell it does not read /etc/spark/conf
> folder files (e.g. spark-env.sh and log4j configuration)
>
> To solve the problem I have to add /etc/spark/conf to SPARK_CLASSPATH
> export SPARK_CLASSPATH=/etc/spark/conf
>
> How to configure spark/emr4 to avoid manual step of adding /etc/spark/conf
> to SPARK_CLASSPATH?
>
> Alex
>


Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Zvara
Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your
write will be performed by the JVM.

On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán  wrote:

> Unfortunately I'm getting the same error:
> The other interesting things are that:
>  - the parquet files got actually written to HDFS (also with
> .write.parquet() )
>  - the application gets stuck in the RUNNING state for good even after the
> error is thrown
>
> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19
> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5
> 15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20
> Exception in thread "Thread-7"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Thread-7"
> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread 
> "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception in thread "Reporter"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Reporter"
> Exception in thread "qtp2134582502-46"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "qtp2134582502-46"
>
>
>
>
> On Mon, Sep 7, 2015 at 3:48 PM, boci  wrote:
>
>> Hi,
>>
>> Can you try to using save method instead of write?
>>
>> ex: out_df.save("path","parquet")
>>
>> b0c1
>>
>>
>> --
>> Skype: boci13, Hangout: boci.b...@gmail.com
>>
>> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth 
>> wrote:
>>
>>> Aaand, the error! :)
>>>
>>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread 
>>> "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>>> Exception in thread "Thread-7"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "Thread-7"
>>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread 
>>> "LeaseRenewer:r...@docker.rapidminer.com:8020"
>>> Exception in thread "Reporter"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "Reporter"
>>> Exception in thread "qtp2115718813-47"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "qtp2115718813-47"
>>>
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
>>>
>>> Log Type: stdout
>>>
>>> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015
>>>
>>> Log Length: 986
>>>
>>> Traceback (most recent call last):
>>>   File "spark-ml.py", line 33, in 
>>> out_df.write.parquet("/tmp/logparquet")
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>>>  line 422, in parquet
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>>  line 538, in __call__
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
>>>  line 36, in deco
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>>  line 300, in get_return_value
>>> py4j.protocol.Py4JJavaError
>>>
>>>
>>>
>>> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth 
>>> wrote:
>>>
 Hi,

 When I execute the Spark ML Logisitc Regression example in pyspark I
 run into an OutOfMemory exception. I'm wondering if any of you experienced
 the same or has a hint about how to fix this.

 The interesting bit is that I only get the exception when I try to
 write the result DataFrame into a file. If I only "print" any of the
 results, it all works fine.

 My Setup:
 Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the
 latest nightly build)
 Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
 -DzincPort=3034

 I'm using the default resource setup
 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread boci
Hi,

Can you try to using save method instead of write?

ex: out_df.save("path","parquet")

b0c1

--
Skype: boci13, Hangout: boci.b...@gmail.com

On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth  wrote:

> Aaand, the error! :)
>
> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
> Exception in thread "Thread-7"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Thread-7"
> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread 
> "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception in thread "Reporter"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Reporter"
> Exception in thread "qtp2115718813-47"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "qtp2115718813-47"
>
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
>
> Log Type: stdout
>
> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015
>
> Log Length: 986
>
> Traceback (most recent call last):
>   File "spark-ml.py", line 33, in 
> out_df.write.parquet("/tmp/logparquet")
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>  line 422, in parquet
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
>  line 36, in deco
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError
>
>
>
> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth  wrote:
>
>> Hi,
>>
>> When I execute the Spark ML Logisitc Regression example in pyspark I run
>> into an OutOfMemory exception. I'm wondering if any of you experienced the
>> same or has a hint about how to fix this.
>>
>> The interesting bit is that I only get the exception when I try to write
>> the result DataFrame into a file. If I only "print" any of the results, it
>> all works fine.
>>
>> My Setup:
>> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
>> nightly build)
>> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
>> -DzincPort=3034
>>
>> I'm using the default resource setup
>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>> capability: )
>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>> capability: )
>>
>> The script I'm executing:
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SQLContext
>>
>> conf = SparkConf().setAppName("pysparktest")
>> sc = SparkContext(conf=conf)
>> sqlContext = SQLContext(sc)
>>
>> from pyspark.mllib.regression import LabeledPoint
>> from pyspark.mllib.linalg import Vector, Vectors
>>
>> training = sc.parallelize((
>>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>>   LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
>>   LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
>>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
>>
>> training_df = training.toDF()
>>
>> from pyspark.ml.classification import LogisticRegression
>>
>> reg = LogisticRegression()
>>
>> reg.setMaxIter(10).setRegParam(0.01)
>> model = reg.fit(training.toDF())
>>
>> test = sc.parallelize((
>>   LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
>>   LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
>>   LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5
>>
>> out_df = model.transform(test.toDF())
>>
>> out_df.write.parquet("/tmp/logparquet")
>>
>> And the command:
>> spark-submit --master yarn --deploy-mode cluster spark-ml.py
>>
>> Thanks,
>> z
>>
>
>


Access a Broadcast variable causes Spark to launch a second context

2015-09-07 Thread sstraub
Hi,

I'm working on a spark job that frequently iterates over huge RDDs and
matches the elements against some Maps that easily fit into memory. So what
I do is to broadcast that Map and reference it from my RDD.

Works like a charm, until at some point it doesn't, and I can't figure out
why...
Please have a look at this:

  def fun(sc: SparkContext, someRDD: RDD[(String)], someMap: RDD[(String,
Double)]) = {
// I want to access the Map multiple times, so I broadcast it
val broadcast = sc.broadcast(someMap.collectAsMap())
// the next line creates one job per element and executes collectAsMap()
over and over again
println(someRDD.take(100).map(s => broadcast.value.getOrElse(s,
0.0)).toList.mkString("\n"))
// the next line creates a new spark context and crashes (only one spark
context per JVM...)
println(someRDD.map(s => broadcast.value.getOrElse(s,
0.0)).collect().mkString("\n"))
  }

Here I'm doing just what I've described above: broadcast a Map and access
the broadcast value while iterating over another RDD.

Now when I take a subset of the RDD (`take(100)`), Spark creates one job per
ELEMENT (that's 100 jobs) where `collectAsMap` is called. Obviously, this
takes quite a lot of time (~500 ms per element).
When I actually want to map over the entire RDD, Spark tries to launch
another Spark context and crashes the whole application.

org.apache.spark.SparkException: Job aborted due to stage failure: Task
2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in stage
37.0 (TID 106, localhost): org.apache.spark.SparkException: Only one
SparkContext may be running in this JVM (see SPARK-2243). To ignore this
error, set spark.driver.allowMultipleContexts = true.

I couldn't reproduce this error in a minimal working example, so there must
be something in my pipeline that is messing things up. The error is 100%
reproducible in my environment and the application runs fine as soon as I
don't access this specific Map from this specific RDD.

Any idea what might cause this problem?
Can I provide you with any other Information (besides posting >500 lines of
code)?

cheers
Sebastian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-a-Broadcast-variable-causes-Spark-to-launch-a-second-context-tp24595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Tóth Zoltán
Unfortunately I'm getting the same error:
The other interesting things are that:
 - the parquet files got actually written to HDFS (also with
.write.parquet() )
 - the application gets stuck in the RUNNING state for good even after the
error is thrown

15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19
15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5
15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20
Exception in thread "Thread-7"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Thread-7"
Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"org.apache.hadoop.hdfs.PeerCache@4070d501"
Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Reporter"
Exception in thread "qtp2134582502-46"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "qtp2134582502-46"




On Mon, Sep 7, 2015 at 3:48 PM, boci  wrote:

> Hi,
>
> Can you try to using save method instead of write?
>
> ex: out_df.save("path","parquet")
>
> b0c1
>
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>
> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth  wrote:
>
>> Aaand, the error! :)
>>
>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>> Exception in thread "Thread-7"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Thread-7"
>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception in thread "Reporter"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Reporter"
>> Exception in thread "qtp2115718813-47"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "qtp2115718813-47"
>>
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
>>
>> Log Type: stdout
>>
>> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015
>>
>> Log Length: 986
>>
>> Traceback (most recent call last):
>>   File "spark-ml.py", line 33, in 
>> out_df.write.parquet("/tmp/logparquet")
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>>  line 422, in parquet
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>  line 538, in __call__
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
>>  line 36, in deco
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>  line 300, in get_return_value
>> py4j.protocol.Py4JJavaError
>>
>>
>>
>> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth 
>> wrote:
>>
>>> Hi,
>>>
>>> When I execute the Spark ML Logisitc Regression example in pyspark I run
>>> into an OutOfMemory exception. I'm wondering if any of you experienced the
>>> same or has a hint about how to fix this.
>>>
>>> The interesting bit is that I only get the exception when I try to write
>>> the result DataFrame into a file. If I only "print" any of the results, it
>>> all works fine.
>>>
>>> My Setup:
>>> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
>>> nightly build)
>>> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
>>> -DzincPort=3034
>>>
>>> I'm using the default resource setup
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: )
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: )
>>>
>>> 

Shared data between algorithms

2015-09-07 Thread Somabha Bhattacharjya
Hello All,

I plan to run 2 clustering algorithms on a shared data (Algo A starts first
and modifies data and then Algo B starts with the modified data part.
Thereafter they run in parallel) in Spark MLLib. Is this possible to share
data between two algorithms in a single pipeline?

Regards,
Somabha Bhattacharjya


Re: Sending yarn application logs to web socket

2015-09-07 Thread Jeetendra Gangele
anybody here to help?



On 7 September 2015 at 17:53, Jeetendra Gangele 
wrote:

> Hi All I have been trying to send my application related logs to socket so
> that we can write log stash and check the application logs.
>
> here is my log4j.property file
>
> main.logger=RFA,SA
>
> log4j.appender.SA=org.apache.log4j.net.SocketAppender
> log4j.appender.SA.Port=4560
> log4j.appender.SA.RemoteHost=hadoop07.housing.com
> log4j.appender.SA.ReconnectionDelay=1
> log4j.appender.SA.Application=NM-${user.dir}
> # Ignore messages below warning level from Jetty, because it's a bit
> verbose
> log4j.logger.org.spark-project.jetty=WARN
> log4j.logger.org.apache.hadoop=WARN
>
>
> I am launching my spark job using below common on YARN-cluster mode
>
> *spark-submit --name data-ingestion --master yarn-cluster --conf
> spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
>  --files
> /usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
> --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
> --class com.housing.spark.streaming.Binning
> /usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*
>
>
> *Can anybody please guide me why i am not getting the logs the socket?*
>
>
> *I followed many pages listing below without success*
>
> http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
>
> http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
>
> http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop
>
>


Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zsolt Tóth
Hi,

I ran your example on Spark-1.4.1 and 1.5.0-rc3. It succeeds on 1.4.1 but
throws the  OOM on 1.5.0.  Do any of you know which PR introduced this
issue?

Zsolt


2015-09-07 16:33 GMT+02:00 Zoltán Zvara :

> Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your
> write will be performed by the JVM.
>
> On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán  wrote:
>
>> Unfortunately I'm getting the same error:
>> The other interesting things are that:
>>  - the parquet files got actually written to HDFS (also with
>> .write.parquet() )
>>  - the application gets stuck in the RUNNING state for good even after
>> the error is thrown
>>
>> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19
>> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5
>> 15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20
>> Exception in thread "Thread-7"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Thread-7"
>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "org.apache.hadoop.hdfs.PeerCache@4070d501"
>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception in thread "Reporter"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Reporter"
>> Exception in thread "qtp2134582502-46"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "qtp2134582502-46"
>>
>>
>>
>>
>> On Mon, Sep 7, 2015 at 3:48 PM, boci  wrote:
>>
>>> Hi,
>>>
>>> Can you try to using save method instead of write?
>>>
>>> ex: out_df.save("path","parquet")
>>>
>>> b0c1
>>>
>>>
>>> --
>>> Skype: boci13, Hangout: boci.b...@gmail.com
>>>
>>> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth 
>>> wrote:
>>>
 Aaand, the error! :)

 Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread 
 "org.apache.hadoop.hdfs.PeerCache@4e000abf"
 Exception in thread "Thread-7"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "Thread-7"
 Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread 
 "LeaseRenewer:r...@docker.rapidminer.com:8020"
 Exception in thread "Reporter"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "Reporter"
 Exception in thread "qtp2115718813-47"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "qtp2115718813-47"

 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"

 Log Type: stdout

 Log Upload Time: Mon Sep 07 09:03:01 -0400 2015

 Log Length: 986

 Traceback (most recent call last):
   File "spark-ml.py", line 33, in 
 out_df.write.parquet("/tmp/logparquet")
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
  line 422, in parquet
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
  line 538, in __call__
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
  line 36, in deco
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError



 On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth 
 wrote:

> Hi,
>
> When I execute the Spark ML Logisitc Regression example in pyspark I
> run into an OutOfMemory exception. I'm wondering if any of you experienced
> the same or has a hint about how to fix this.
>
> The interesting bit is that I only get the exception when I try to
> write the result DataFrame into a file. If I only "print" any of the
> 

Re: Spark on Yarn vs Standalone

2015-09-07 Thread Sandy Ryza
Hi Alex,

If they're both configured correctly, there's no reason that Spark
Standalone should provide performance or memory improvement over Spark on
YARN.

-Sandy

On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov 
wrote:

> Hi Everyone
>
> We are trying the latest aws emr-4.0.0 and Spark and my question is about
> YARN vs Standalone mode.
> Our usecase is
> - start 100-150 nodes cluster every week,
> - run one heavy spark job (5-6 hours)
> - save data to s3
> - stop cluster
>
> Officially aws emr-4.0.0 comes with Spark on Yarn
> It's probably possible to hack emr by creating bootstrap script which
> stops yarn and starts master and slaves on each computer  (to start Spark
> in standalone mode)
>
> My questions are
> - Does Spark standalone provides significant performance / memory
> improvement in comparison to YARN mode?
> - Does it worth hacking official emr Spark on Yarn and switch Spark to
> Standalone mode?
>
>
> I already created comparison table and want you to check if my
> understanding is correct
>
> Lets say r3.2xlarge computer has 52GB ram available for Spark Executor JVMs
>
> standalone to yarn comparison
>
>
>   STDLN   YARN
>
> can executor allocate up to 52GB ram   - yes  |
>  yes
>
> will executor be unresponsive after using all 52GB ram because of GC - yes
>  |  yes
>
> additional JVMs on slave except of spark executor- workr | node
> mngr
>
> are additional JVMs lightweight - yes
>  |  yes
>
>
> Thank you
>
> Alex
>


Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html

Implementation seems missing backpropagation?
Was there is a good reason to omit BP?
What are the drawbacks of a pure feedforward-only ANN?

Thanks!


-- 
Ruslan Dautkhanov


Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-07 Thread Nicholas R. Peterson
I'm trying to run a Spark 1.4.1 job on my CDH5.4 cluster, through Yarn.
Serialization is set to use Kryo.

I have a large object which I send to the executors as a Broadcast. The
object seems to serialize just fine. When it attempts to deserialize,
though, Kryo throws a ClassNotFoundException... for a class that I include
in the fat jar that I spark-submit.

What could be causing this classpath issue with Kryo on the executors?
Where should I even start looking to try to diagnose the problem? I
appreciate any help you can provide.

Thank you!

-- Nick


Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
Thank you - it works if the file is created in Spark

On Mon, Sep 7, 2015 at 3:06 PM, Ruslan Dautkhanov 
wrote:

> Read response from Cheng Lian  on Aug/27th - it
> looks the same problem.
>
> Workarounds
> 1. write that parquet file in Spark;
> 2. upgrade to Spark 1.5.
>
> --
> Ruslan Dautkhanov
>
> On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov  wrote:
>
>> No, it was created in Hive by CTAS, but any help is appreciated...
>>
>> On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov 
>> wrote:
>>
>>> That parquet table wasn't created in Spark, is it?
>>>
>>> There was a recent discussion on this list that complex data types in
>>> Spark prior to 1.5 often incompatible with Hive for example, if I remember
>>> correctly.
>>> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov  wrote:
>>>
 I am trying to read an (array typed) parquet file in spark-shell (Spark
 1.4.1 with Hadoop 2.6):

 {code}
 $ bin/spark-shell
 log4j:WARN No appenders could be found for logger
 (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
 for more info.
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to:
 hivedata
 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(hivedata);
 users with modify permissions: Set(hivedata)
 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
 server' on port 43731.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
   /_/

 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.8.0)
 Type in expressions to have them evaluated.
 Type :help for more information.
 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to:
 hivedata
 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(hivedata);
 users with modify permissions: Set(hivedata)
 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
 15/09/07 13:45:27 INFO Remoting: Starting remoting
 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on
 addresses :[akka.tcp://sparkDriver@10.10.30.52:46083]
 15/09/07 13:45:27 INFO Utils: Successfully started service
 'sparkDriver' on port 46083.
 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
 /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
 265.1 MB
 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
 server' on port 38717.
 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
 4040. Attempting port 4041.
 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
 port 4041.
 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at
 http://10.10.30.52:4041
 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
 localhost
 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
 http://10.10.30.52:43731
 15/09/07 13:45:27 INFO Utils: Successfully started service
 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on
 60973
 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register
 BlockManager
 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
 manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
 localhost, 60973)
 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
 Spark context available as sc.
 15/09/07 

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
No, it was created in Hive by CTAS, but any help is appreciated...

On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov 
wrote:

> That parquet table wasn't created in Spark, is it?
>
> There was a recent discussion on this list that complex data types in
> Spark prior to 1.5 often incompatible with Hive for example, if I remember
> correctly.
> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov  wrote:
>
>> I am trying to read an (array typed) parquet file in spark-shell (Spark
>> 1.4.1 with Hadoop 2.6):
>>
>> {code}
>> $ bin/spark-shell
>> log4j:WARN No appenders could be found for logger
>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>> log4j:WARN Please initialize the log4j system properly.
>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>> more info.
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>> users with modify permissions: Set(hivedata)
>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
>> server' on port 43731.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>>   /_/
>>
>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>> users with modify permissions: Set(hivedata)
>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
>> 15/09/07 13:45:27 INFO Remoting: Starting remoting
>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkDriver@10.10.30.52:46083]
>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
>> on port 46083.
>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
>> 265.1 MB
>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
>> server' on port 38717.
>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
>> 4040. Attempting port 4041.
>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
>> port 4041.
>> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at
>> http://10.10.30.52:4041
>> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
>> localhost
>> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
>> http://10.10.30.52:43731
>> 15/09/07 13:45:27 INFO Utils: Successfully started service
>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
>> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
>> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
>> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
>> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
>> localhost, 60973)
>> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
>> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
>> Spark context available as sc.
>> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
>> 0.13.1
>> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
>> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>> 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
>> 15/09/07 13:45:29 INFO Persistence: Property
>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>> 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
>> unknown - will be ignored
>> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
>> CLASSPATH (or one of dependencies)
>> 15/09/07 13:45:29 

Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
Thanks!

It does not look Spark ANN yet supports dropout/dropconnect or any other
techniques that help avoiding overfitting?
http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
https://cs.nyu.edu/~wanli/dropc/dropc.pdf

ps. There is a small copy-paste typo in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
should read B :)



-- 
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
wrote:

> Backprop is used to compute the gradient here
> ,
> which is then optimized by SGD or LBFGS here
> 
>
> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath 
> wrote:
>
>> Haven't checked the actual code but that doc says "MLPC employes
>> backpropagation for learning the model. .."?
>>
>>
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov 
>> wrote:
>>
>>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>>
>>> Implementation seems missing backpropagation?
>>> Was there is a good reason to omit BP?
>>> What are the drawbacks of a pure feedforward-only ANN?
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>
>>
>


Re: Spark ANN

2015-09-07 Thread Ruslan Dautkhanov
Found a dropout commit from avulanov:
https://github.com/avulanov/spark/commit/3f25e26d10ef8617e46e35953fe0ad1a178be69d

It probably hasn't made its way to MLLib (yet?).



-- 
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 8:34 PM, Feynman Liang  wrote:

> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on
> the roadmap for 1.6 
> though, and there is a spark package
>  for
> dropout regularized logistic regression.
>
>
> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
> wrote:
>
>> Thanks!
>>
>> It does not look Spark ANN yet supports dropout/dropconnect or any other
>> techniques that help avoiding overfitting?
>> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>>
>> ps. There is a small copy-paste typo in
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>> should read B :)
>>
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
>> wrote:
>>
>>> Backprop is used to compute the gradient here
>>> ,
>>> which is then optimized by SGD or LBFGS here
>>> 
>>>
>>> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 Haven't checked the actual code but that doc says "MLPC employes
 backpropagation for learning the model. .."?



 —
 Sent from Mailbox 


 On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov  wrote:

> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>
> Implementation seems missing backpropagation?
> Was there is a good reason to omit BP?
> What are the drawbacks of a pure feedforward-only ANN?
>
> Thanks!
>
>
> --
> Ruslan Dautkhanov
>


>>>
>>
>


Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
Read response from Cheng Lian  on Aug/27th - it
looks the same problem.

Workarounds
1. write that parquet file in Spark;
2. upgrade to Spark 1.5.


--
Ruslan Dautkhanov

On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov  wrote:

> No, it was created in Hive by CTAS, but any help is appreciated...
>
> On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov 
> wrote:
>
>> That parquet table wasn't created in Spark, is it?
>>
>> There was a recent discussion on this list that complex data types in
>> Spark prior to 1.5 often incompatible with Hive for example, if I remember
>> correctly.
>> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov  wrote:
>>
>>> I am trying to read an (array typed) parquet file in spark-shell (Spark
>>> 1.4.1 with Hadoop 2.6):
>>>
>>> {code}
>>> $ bin/spark-shell
>>> log4j:WARN No appenders could be found for logger
>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>> log4j:WARN Please initialize the log4j system properly.
>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>>> for more info.
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
>>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
>>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>> users with modify permissions: Set(hivedata)
>>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
>>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
>>> server' on port 43731.
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>>>   /_/
>>>
>>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.8.0)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
>>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
>>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
>>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>> users with modify permissions: Set(hivedata)
>>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
>>> 15/09/07 13:45:27 INFO Remoting: Starting remoting
>>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://sparkDriver@10.10.30.52:46083]
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
>>> on port 46083.
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
>>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
>>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
>>> 265.1 MB
>>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
>>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
>>> server' on port 38717.
>>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
>>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
>>> 4040. Attempting port 4041.
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
>>> port 4041.
>>> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at
>>> http://10.10.30.52:4041
>>> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
>>> localhost
>>> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
>>> http://10.10.30.52:43731
>>> 15/09/07 13:45:27 INFO Utils: Successfully started service
>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
>>> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
>>> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register
>>> BlockManager
>>> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
>>> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
>>> localhost, 60973)
>>> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
>>> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
>>> Spark context available as sc.
>>> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
>>> 0.13.1
>>> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
>>> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>>> 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, 

Re: Sending yarn application logs to web socket

2015-09-07 Thread Yana Kadiyska
Hopefully someone will give you a more direct answer but whenever I'm
having issues with log4j I always try -Dlog4j.debug=true.This will tell you
which log4j settings are getting picked up from where. I've spent countless
hours due to typos in the file, for example.

On Mon, Sep 7, 2015 at 11:47 AM, Jeetendra Gangele 
wrote:

> I also tried placing my costomized log4j.properties file under
> src/main/resources still no luck.
>
> won't above step modify the default YARN and spark  log4j.properties  ?
>
> anyhow its still taking log4j.properties from YARn.
>
>
>
> On 7 September 2015 at 19:25, Jeetendra Gangele 
> wrote:
>
>> anybody here to help?
>>
>>
>>
>> On 7 September 2015 at 17:53, Jeetendra Gangele 
>> wrote:
>>
>>> Hi All I have been trying to send my application related logs to socket
>>> so that we can write log stash and check the application logs.
>>>
>>> here is my log4j.property file
>>>
>>> main.logger=RFA,SA
>>>
>>> log4j.appender.SA=org.apache.log4j.net.SocketAppender
>>> log4j.appender.SA.Port=4560
>>> log4j.appender.SA.RemoteHost=hadoop07.housing.com
>>> log4j.appender.SA.ReconnectionDelay=1
>>> log4j.appender.SA.Application=NM-${user.dir}
>>> # Ignore messages below warning level from Jetty, because it's a bit
>>> verbose
>>> log4j.logger.org.spark-project.jetty=WARN
>>> log4j.logger.org.apache.hadoop=WARN
>>>
>>>
>>> I am launching my spark job using below common on YARN-cluster mode
>>>
>>> *spark-submit --name data-ingestion --master yarn-cluster --conf
>>> spark.custom.configuration.file=hdfs://10.1.6.186/configuration/binning-dev.conf
>>>  --files
>>> /usr/hdp/current/spark-client/Runnable/conf/log4j.properties --conf
>>> "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>>> --conf
>>> "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
>>> --class com.housing.spark.streaming.Binning
>>> /usr/hdp/current/spark-client/Runnable/dsl-data-ingestion-all.jar*
>>>
>>>
>>> *Can anybody please guide me why i am not getting the logs the socket?*
>>>
>>>
>>> *I followed many pages listing below without success*
>>>
>>> http://tech-stories.com/2015/02/12/setting-up-a-central-logging-infrastructure-for-hadoop-and-spark/#comment-208
>>>
>>> http://stackoverflow.com/questions/22918720/custom-log4j-appender-in-hadoop-2
>>>
>>> http://stackoverflow.com/questions/9081625/override-log4j-properties-in-hadoop
>>>
>>>
>>
>
>
>
>


Re: hadoop2.6.0 + spark1.4.1 + python2.7.10

2015-09-07 Thread Ashish Dutt
Hello Sasha,

I have no answer for debian. My cluster is on Linux and I'm using CDH 5.4
Your question-  "Error from python worker:
  /cube/PY/Python27/bin/python: No module named pyspark"

On a single node (ie one server/machine/computer) I installed pyspark
binaries and it worked. Connected it to pycharm and it worked too.

Next I tried executing pyspark command on another node (say the worker) in
the cluster and i got this error message, Error from python worker: PATH:
No module named pyspark".

My first guess was that the worker is not picking up the path of pyspark
binaries installed on the server ( I tried many a things like hard-coding
the pyspark path in the config.sh file on the worker- NO LUCK; tried
dynamic path from the code in pycharm- NO LUCK... ; searched the web and
asked the question in almost every online forum--NO LUCK..; banged my head
several times with pyspark/hadoop books--NO LUCK... Finally, one fine day a
'watermelon' dropped while brooding on this problem and I installed pyspark
binaries on all the worker machines ) Now when I tried executing just the
command pyspark on the worker's it worked. Tried some simple program
snippets on each worker, it works too.

I am not sure if this will help or not for your use-case.



Sincerely,
Ashish

On Mon, Sep 7, 2015 at 11:04 PM, Sasha Kacanski  wrote:

> Thanks Ashish,
> nice blog but does not cover my issue. Actually I have pycharm running and
> loading pyspark and rest of libraries perfectly fine.
> My issue is that I am not sure what is triggering
>
> Error from python worker:
>   /cube/PY/Python27/bin/python: No module named pyspark
> pyspark
> PYTHONPATH was:
>
> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.
> 4.1-hadoop2.6.0.jar
>
> Question is why is yarn not getting python package to run on the single
> node via YARN?
> Some people are saying run with JAVA 6 due to zip library changes between
> 6/7/8, some identified bug w RH, i am on debian,  then some documentation
> errors but nothing is really clear.
>
> i have binaries for spark hadoop and i did just fine with spark sql
> module, hive, python, pandas ad yarn.
> Locally as i said app is working fine (pandas to spark df to parquet)
> But as soon as I move to yarn client mode yarn is not getting packages
> required to run app.
>
> If someone confirms that I need to build everything from source with
> specific version of software I will do that, but at this point I am not
> sure what to do to remedy this situation...
>
> --sasha
>
>
> On Sun, Sep 6, 2015 at 8:27 PM, Ashish Dutt 
> wrote:
>
>> Hi Aleksandar,
>> Quite some time ago, I faced the same problem and I found a solution
>> which I have posted here on my blog
>> .
>> See if that can help you and if it does not then you can check out these
>> questions & solution on stackoverflow
>>  website
>>
>>
>> Sincerely,
>> Ashish Dutt
>>
>>
>> On Mon, Sep 7, 2015 at 7:17 AM, Sasha Kacanski 
>> wrote:
>>
>>> Hi,
>>> I am successfully running python app via pyCharm in local mode
>>> setMaster("local[*]")
>>>
>>> When I turn on SparkConf().setMaster("yarn-client")
>>>
>>> and run via
>>>
>>> park-submit PysparkPandas.py
>>>
>>>
>>> I run into issue:
>>> Error from python worker:
>>>   /cube/PY/Python27/bin/python: No module named pyspark
>>> PYTHONPATH was:
>>>
>>> /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/filecache/18/spark-assembly-1.4.1-hadoop2.6.0.jar
>>>
>>> I am running java
>>> hadoop@pluto:~/pySpark$ /opt/java/jdk/bin/java -version
>>> java version "1.8.0_31"
>>> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
>>>
>>> Should I try same thing with java 6/7
>>>
>>> Is this packaging issue or I have something wrong with configurations ...
>>>
>>> Regards,
>>>
>>> --
>>> Aleksandar Kacanski
>>>
>>
>>
>
>
> --
> Aleksandar Kacanski
>


Re: Spark ANN

2015-09-07 Thread Feynman Liang
Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on
the roadmap for 1.6 
though, and there is a spark package
 for
dropout regularized logistic regression.


On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
wrote:

> Thanks!
>
> It does not look Spark ANN yet supports dropout/dropconnect or any other
> techniques that help avoiding overfitting?
> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>
> ps. There is a small copy-paste typo in
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
> should read B :)
>
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
> wrote:
>
>> Backprop is used to compute the gradient here
>> ,
>> which is then optimized by SGD or LBFGS here
>> 
>>
>> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath > > wrote:
>>
>>> Haven't checked the actual code but that doc says "MLPC employes
>>> backpropagation for learning the model. .."?
>>>
>>>
>>>
>>> —
>>> Sent from Mailbox 
>>>
>>>
>>> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov 
>>> wrote:
>>>
 http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html

 Implementation seems missing backpropagation?
 Was there is a good reason to omit BP?
 What are the drawbacks of a pure feedforward-only ANN?

 Thanks!


 --
 Ruslan Dautkhanov

>>>
>>>
>>
>


Re: Spark ANN

2015-09-07 Thread Feynman Liang
BTW thanks for pointing out the typos, I've included them in my MLP cleanup
PR 

On Mon, Sep 7, 2015 at 7:34 PM, Feynman Liang  wrote:

> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is on
> the roadmap for 1.6 
> though, and there is a spark package
>  for
> dropout regularized logistic regression.
>
>
> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
> wrote:
>
>> Thanks!
>>
>> It does not look Spark ANN yet supports dropout/dropconnect or any other
>> techniques that help avoiding overfitting?
>> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>>
>> ps. There is a small copy-paste typo in
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>> should read B :)
>>
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
>> wrote:
>>
>>> Backprop is used to compute the gradient here
>>> ,
>>> which is then optimized by SGD or LBFGS here
>>> 
>>>
>>> On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 Haven't checked the actual code but that doc says "MLPC employes
 backpropagation for learning the model. .."?



 —
 Sent from Mailbox 


 On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov  wrote:

> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>
> Implementation seems missing backpropagation?
> Was there is a good reason to omit BP?
> What are the drawbacks of a pure feedforward-only ANN?
>
> Thanks!
>
>
> --
> Ruslan Dautkhanov
>


>>>
>>
>


Re: Spark ANN

2015-09-07 Thread Debasish Das
Not sure dropout but if you change the solver from breeze bfgs to breeze
owlqn or breeze.proximal.NonlinearMinimizer you can solve ann loss with l1
regularization which will yield elastic net style sparse solutionsusing
that you can clean up edges which has 0.0 as weight...
On Sep 7, 2015 7:35 PM, "Feynman Liang"  wrote:

> BTW thanks for pointing out the typos, I've included them in my MLP
> cleanup PR 
>
> On Mon, Sep 7, 2015 at 7:34 PM, Feynman Liang 
> wrote:
>
>> Unfortunately, not yet... Deep learning support (autoencoders, RBMs) is
>> on the roadmap for 1.6
>>  though, and there is
>> a spark package
>>  for
>> dropout regularized logistic regression.
>>
>>
>> On Mon, Sep 7, 2015 at 3:15 PM, Ruslan Dautkhanov 
>> wrote:
>>
>>> Thanks!
>>>
>>> It does not look Spark ANN yet supports dropout/dropconnect or any other
>>> techniques that help avoiding overfitting?
>>> http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
>>> https://cs.nyu.edu/~wanli/dropc/dropc.pdf
>>>
>>> ps. There is a small copy-paste typo in
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/ann/BreezeUtil.scala#L43
>>> should read B :)
>>>
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> On Mon, Sep 7, 2015 at 12:47 PM, Feynman Liang 
>>> wrote:
>>>
 Backprop is used to compute the gradient here
 ,
 which is then optimized by SGD or LBFGS here
 

 On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath <
 nick.pentre...@gmail.com> wrote:

> Haven't checked the actual code but that doc says "MLPC employes
> backpropagation for learning the model. .."?
>
>
>
> —
> Sent from Mailbox 
>
>
> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov <
> dautkha...@gmail.com> wrote:
>
>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>
>> Implementation seems missing backpropagation?
>> Was there is a good reason to omit BP?
>> What are the drawbacks of a pure feedforward-only ANN?
>>
>> Thanks!
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>
>

>>>
>>
>


Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
That parquet table wasn't created in Spark, is it?

There was a recent discussion on this list that complex data types in Spark
prior to 1.5 often incompatible with Hive for example, if I remember
correctly.

On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov  wrote:

> I am trying to read an (array typed) parquet file in spark-shell (Spark
> 1.4.1 with Hadoop 2.6):
>
> {code}
> $ bin/spark-shell
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
> server' on port 43731.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>   /_/
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
> 15/09/07 13:45:27 INFO Remoting: Starting remoting
> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@10.10.30.52:46083]
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
> on port 46083.
> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
> 265.1 MB
> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
> server' on port 38717.
> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041
> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
> localhost
> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
> http://10.10.30.52:43731
> 15/09/07 13:45:27 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
> localhost, 60973)
> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
> Spark context available as sc.
> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
> 15/09/07 13:45:29 INFO Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes
> with
> 

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
The same error if I do:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = sqlContext.sql("SELECT * FROM stats")

but it does work from Hive shell directly...

On Mon, Sep 7, 2015 at 1:56 PM, Alex Kozlov  wrote:

> I am trying to read an (array typed) parquet file in spark-shell (Spark
> 1.4.1 with Hadoop 2.6):
>
> {code}
> $ bin/spark-shell
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
> server' on port 43731.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>   /_/
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
> 15/09/07 13:45:27 INFO Remoting: Starting remoting
> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@10.10.30.52:46083]
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
> on port 46083.
> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
> 265.1 MB
> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
> server' on port 38717.
> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041
> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
> localhost
> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
> http://10.10.30.52:43731
> 15/09/07 13:45:27 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
> localhost, 60973)
> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
> Spark context available as sc.
> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
> 15/09/07 13:45:29 INFO Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes
> with
> 

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-07 Thread Reynold Xin
On Wed, Sep 2, 2015 at 12:03 AM, Anders Arpteg  wrote:

>
> BTW, is it possible (or will it be) to use Tungsten with dynamic
> allocation and the external shuffle manager?
>
>
Yes - I think this already works. There isn't anything specific here
related to Tungsten.


Spark summit Asia

2015-09-07 Thread Kevin Jung
Is there any plan to hold Spark summit in Asia?
I'm very much looking forward to it.

Kevin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-summit-Asia-tp24598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Support of other languages?

2015-09-07 Thread Rahul Palamuttam
Hi, 
I wanted to know more about how Spark supports R and Python, with respect to
what gets copied into the language environments.

To clarify :

I know that PySpark utilizes py4j sockets to pass pickled python functions
between the JVM and the python daemons. However, I wanted to know how it
passes the data from the JVM into the daemon environment. I assume it has to
copy the data over into the new environment, since python can't exactly
operate in JVM heap space, (or can it?).  

I had the same question with respect to SparkR, though I'm not completely
familiar with how they pass around native R code through the worker JVM's. 

The primary question I wanted to ask is does Spark make a second copy of
data, so language-specific daemons can operate on the data? What are some of
the other limitations encountered when we try to offer multi-language
support, whether it's in performance or in general software architecture.
With python in particular the collect operation must be first written to
disk and then read back from the python driver process.

Would appreciate any insight on this, and if there is any work happening in
this area.

Thank you,

Rahul Palamuttam  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Support-of-other-languages-tp24599.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Gheorghe Postelnicu
Hi,

The following code fails when compiled from SBT:

package main.scala

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

object TestMain {
  def main(args: Array[String]): Unit = {
implicit val sparkContext = new SparkContext()
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
sparkContext.parallelize(1 to 10).map(i => (i,
i.toString)).toDF("intCol", "strCol")
  }
}

with the following error:

15/09/07 21:39:21 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
at main.scala.Bof$.main(Bof.scala:14)
at main.scala.Bof.main(Bof.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/09/07 21:39:22 INFO SparkContext: Invoking stop() from shutdown hook

whereas the code above works in a spark shell.

The code is compiled using Scala 2.11.6 and precompiled Spark 1.4.1

Any suggestion on how to fix this would be much appreciated.

Best,
Gheorghe


Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney
Try adding the following to your build.sbt

libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6"


I believe that spark shades the scala library, and this is a library
that it looks like you need in an unshaded way.


2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu <
gheorghe.posteln...@gmail.com>:

> Hi,
>
> The following code fails when compiled from SBT:
>
> package main.scala
>
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
>
> object TestMain {
>   def main(args: Array[String]): Unit = {
> implicit val sparkContext = new SparkContext()
> val sqlContext = new SQLContext(sparkContext)
> import sqlContext.implicits._
> sparkContext.parallelize(1 to 10).map(i => (i,
> i.toString)).toDF("intCol", "strCol")
>   }
> }
>
> with the following error:
>
> 15/09/07 21:39:21 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
> at main.scala.Bof$.main(Bof.scala:14)
> at main.scala.Bof.main(Bof.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 15/09/07 21:39:22 INFO SparkContext: Invoking stop() from shutdown hook
>
> whereas the code above works in a spark shell.
>
> The code is compiled using Scala 2.11.6 and precompiled Spark 1.4.1
>
> Any suggestion on how to fix this would be much appreciated.
>
> Best,
> Gheorghe
>
>


Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
I am trying to read an (array typed) parquet file in spark-shell (Spark
1.4.1 with Hadoop 2.6):

{code}
$ bin/spark-shell
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hivedata);
users with modify permissions: Set(hivedata)
15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
server' on port 43731.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hivedata);
users with modify permissions: Set(hivedata)
15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
15/09/07 13:45:27 INFO Remoting: Starting remoting
15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@10.10.30.52:46083]
15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver' on
port 46083.
15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
/tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity 265.1
MB
15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
server' on port 38717.
15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
port 4041.
15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041
15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
localhost
15/09/07 13:45:27 INFO Executor: Using REPL class URI:
http://10.10.30.52:43731
15/09/07 13:45:27 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
localhost, 60973)
15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
15/09/07 13:45:28 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
0.13.1
15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
15/09/07 13:45:29 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/09/07 13:45:36 INFO MetaStoreDirectSql: MySQL check failed, assuming we
are not on mysql: Lexical error at line 1, column 5.  Encountered: "@"
(64), after : "".
15/09/07 13:45:37 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
15/09/07 13:45:37 INFO Datastore: The class

Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Jonathan Coveney
How are you building and running it?

El lunes, 7 de septiembre de 2015, Gheorghe Postelnicu <
gheorghe.posteln...@gmail.com> escribió:

> Interesting idea. Tried that, didn't work. Here is my new SBT file:
>
> name := """testMain"""
>
> scalaVersion := "2.11.6"
>
> libraryDependencies ++= Seq(
>   "org.apache.spark" %% "spark-core" % "1.4.1" % "provided",
>   "org.apache.spark" %% "spark-sql" % "1.4.1" % "provided",
>   "org.scala-lang" % "scala-reflect" % "2.11.6"
> )
>
>
> On Mon, Sep 7, 2015 at 9:55 PM, Jonathan Coveney  > wrote:
>
>> Try adding the following to your build.sbt
>>
>> libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6"
>>
>>
>> I believe that spark shades the scala library, and this is a library that it 
>> looks like you need in an unshaded way.
>>
>>
>> 2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu <
>> gheorghe.posteln...@gmail.com
>> >:
>>
>>> Hi,
>>>
>>> The following code fails when compiled from SBT:
>>>
>>> package main.scala
>>>
>>> import org.apache.spark.SparkContext
>>> import org.apache.spark.sql.SQLContext
>>>
>>> object TestMain {
>>>   def main(args: Array[String]): Unit = {
>>> implicit val sparkContext = new SparkContext()
>>> val sqlContext = new SQLContext(sparkContext)
>>> import sqlContext.implicits._
>>> sparkContext.parallelize(1 to 10).map(i => (i,
>>> i.toString)).toDF("intCol", "strCol")
>>>   }
>>> }
>>>
>>> with the following error:
>>>
>>> 15/09/07 21:39:21 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
>>> at main.scala.Bof$.main(Bof.scala:14)
>>> at main.scala.Bof.main(Bof.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> 15/09/07 21:39:22 INFO SparkContext: Invoking stop() from shutdown hook
>>>
>>> whereas the code above works in a spark shell.
>>>
>>> The code is compiled using Scala 2.11.6 and precompiled Spark 1.4.1
>>>
>>> Any suggestion on how to fix this would be much appreciated.
>>>
>>> Best,
>>> Gheorghe
>>>
>>>
>>
>


Re: Spark 1.4 RDD to DF fails with toDF()

2015-09-07 Thread Gheorghe Postelnicu
sbt assembly; $SPARK_HOME/bin/spark-submit --class main.scala.TestMain
--master "local[4]" target/scala-2.11/bof-assembly-0.1-SNAPSHOT.jar

using Spark:

/opt/spark-1.4.1-bin-hadoop2.6

On Mon, Sep 7, 2015 at 10:20 PM, Jonathan Coveney 
wrote:

> How are you building and running it?
>
>
> El lunes, 7 de septiembre de 2015, Gheorghe Postelnicu <
> gheorghe.posteln...@gmail.com> escribió:
>
>> Interesting idea. Tried that, didn't work. Here is my new SBT file:
>>
>> name := """testMain"""
>>
>> scalaVersion := "2.11.6"
>>
>> libraryDependencies ++= Seq(
>>   "org.apache.spark" %% "spark-core" % "1.4.1" % "provided",
>>   "org.apache.spark" %% "spark-sql" % "1.4.1" % "provided",
>>   "org.scala-lang" % "scala-reflect" % "2.11.6"
>> )
>>
>>
>> On Mon, Sep 7, 2015 at 9:55 PM, Jonathan Coveney 
>> wrote:
>>
>>> Try adding the following to your build.sbt
>>>
>>> libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.6"
>>>
>>>
>>> I believe that spark shades the scala library, and this is a library that 
>>> it looks like you need in an unshaded way.
>>>
>>>
>>> 2015-09-07 16:48 GMT-04:00 Gheorghe Postelnicu <
>>> gheorghe.posteln...@gmail.com>:
>>>
 Hi,

 The following code fails when compiled from SBT:

 package main.scala

 import org.apache.spark.SparkContext
 import org.apache.spark.sql.SQLContext

 object TestMain {
   def main(args: Array[String]): Unit = {
 implicit val sparkContext = new SparkContext()
 val sqlContext = new SQLContext(sparkContext)
 import sqlContext.implicits._
 sparkContext.parallelize(1 to 10).map(i => (i,
 i.toString)).toDF("intCol", "strCol")
   }
 }

 with the following error:

 15/09/07 21:39:21 INFO BlockManagerMaster: Registered BlockManager
 Exception in thread "main" java.lang.NoSuchMethodError:
 scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
 at main.scala.Bof$.main(Bof.scala:14)
 at main.scala.Bof.main(Bof.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 15/09/07 21:39:22 INFO SparkContext: Invoking stop() from shutdown hook

 whereas the code above works in a spark shell.

 The code is compiled using Scala 2.11.6 and precompiled Spark 1.4.1

 Any suggestion on how to fix this would be much appreciated.

 Best,
 Gheorghe


>>>
>>


Adding additional jars to distributed cache (yarn-client)

2015-09-07 Thread Srikanth Sundarrajan
Hi,
Am trying to use JavaSparkContext() to create a new SparkContext and 
attempted to pass the requisite jars. But looks like they aren't getting added 
to the distributed cache automatically. Looking into 
YarnClientSchedulerBackend::start() and ClientArguments, it did seem like it 
would just add the SPARK_JAR and APP_JAR. Am wondering what is the best way to 
add additional files to Distributed cache and also have them appear in the 
classpath for ExecutorLauncher.

Thanks
Srikanth Sundarrajan
  

Re: Spark ANN

2015-09-07 Thread Nick Pentreath
Haven't checked the actual code but that doc says "MLPC employes 
backpropagation for learning the model. .."?










—
Sent from Mailbox

On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov 
wrote:

> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
> Implementation seems missing backpropagation?
> Was there is a good reason to omit BP?
> What are the drawbacks of a pure feedforward-only ANN?
> Thanks!
> -- 
> Ruslan Dautkhanov

Re: Spark ANN

2015-09-07 Thread Feynman Liang
Backprop is used to compute the gradient here
,
which is then optimized by SGD or LBFGS here


On Mon, Sep 7, 2015 at 11:24 AM, Nick Pentreath 
wrote:

> Haven't checked the actual code but that doc says "MLPC employes
> backpropagation for learning the model. .."?
>
>
>
> —
> Sent from Mailbox 
>
>
> On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov 
> wrote:
>
>> http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html
>>
>> Implementation seems missing backpropagation?
>> Was there is a good reason to omit BP?
>> What are the drawbacks of a pure feedforward-only ANN?
>>
>> Thanks!
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>
>


Re: Spark on Yarn vs Standalone

2015-09-07 Thread Alexander Pivovarov
Hi Sandy

Thank you for your reply
Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB)
with emr setting for Spark "maximizeResourceAllocation": "true"

It is automatically converted to Spark settings
spark.executor.memory47924M
spark.yarn.executor.memoryOverhead 5324

we also set spark.default.parallelism = slave_count * 16

Does it look good for you? (we run single heavy job on cluster)

Alex

On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza  wrote:

> Hi Alex,
>
> If they're both configured correctly, there's no reason that Spark
> Standalone should provide performance or memory improvement over Spark on
> YARN.
>
> -Sandy
>
> On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov 
> wrote:
>
>> Hi Everyone
>>
>> We are trying the latest aws emr-4.0.0 and Spark and my question is about
>> YARN vs Standalone mode.
>> Our usecase is
>> - start 100-150 nodes cluster every week,
>> - run one heavy spark job (5-6 hours)
>> - save data to s3
>> - stop cluster
>>
>> Officially aws emr-4.0.0 comes with Spark on Yarn
>> It's probably possible to hack emr by creating bootstrap script which
>> stops yarn and starts master and slaves on each computer  (to start Spark
>> in standalone mode)
>>
>> My questions are
>> - Does Spark standalone provides significant performance / memory
>> improvement in comparison to YARN mode?
>> - Does it worth hacking official emr Spark on Yarn and switch Spark to
>> Standalone mode?
>>
>>
>> I already created comparison table and want you to check if my
>> understanding is correct
>>
>> Lets say r3.2xlarge computer has 52GB ram available for Spark Executor
>> JVMs
>>
>> standalone to yarn comparison
>>
>>
>>   STDLN   YARN
>>
>> can executor allocate up to 52GB ram   - yes  |
>>  yes
>>
>> will executor be unresponsive after using all 52GB ram because of GC -
>> yes  |  yes
>>
>> additional JVMs on slave except of spark executor- workr | node
>> mngr
>>
>> are additional JVMs lightweight - yes
>>  |  yes
>>
>>
>> Thank you
>>
>> Alex
>>
>
>