apache-spark doesn't work correktly with russian alphabet

2017-01-18 Thread AlexModestov
I want to use Apache Spark for working with text data. There are some Russian
symbols but Apache Spark shows me strings which look like as
"...\u0413\u041e\u0420\u041e...". What should I do for correcting them.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-doesn-t-work-correktly-with-russian-alphabet-tp28316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark use /tmp directory instead of directory from spark.local.dir

2016-12-15 Thread AlexModestov
Hello!
I want to use another dir instaed of /tmp directory for all stuff...
I set spark.local.dir and -Djava.io.tmpdir=/... but I see that Spark uses
/tmp for some data...
What does Spark do? And what should I do my Spark uses only my directories?
Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-use-tmp-directory-instead-of-directory-from-spark-local-dir-tp28217.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



work with russian letters

2016-08-24 Thread AlexModestov
  Hello everybody,

  I want to work with DataFrames where some columns have a string type. And
there are russian letters.
  Russian letters are incorrect in the text. Could you help me how I should
work with them?
  
  Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/work-with-russian-letters-tp27594.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



GC overhead limit exceeded

2016-05-16 Thread AlexModestov
I get the error in the apache spark...

"spark.driver.memory 60g
spark.python.worker.memory 60g
spark.master local[*]"

The amount of data is about 5Gb, but spark says that "GC overhead limit
exceeded". I guess that my conf-file gives enought resources.

"16/05/16 15:13:02 WARN NettyRpcEndpointRef: Error sending message [message
= Heartbeat(driver,[Lscala.Tuple2;@87576f9,BlockManagerId(driver, localhost,
59407))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10
seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
at
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
at
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
at
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at
org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[10 seconds]
at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
16/05/16 15:13:02 WARN NettyRpcEnv: Ignored message:
HeartbeatResponse(false)
05-16 15:13:26.398 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.74 GB + FREE:11.03 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:13:44.528 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.86 GB + FREE:10.90 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:13:56.847 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.88 GB + FREE:10.88 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:14:10.215 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.90 GB + FREE:10.86 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:14:33.622 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.91 GB + FREE:10.85 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:14:47.075 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:15:10.555 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.92 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:15:25.520 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
05-16 15:15:39.087 127.0.0.1:54321   2059   #e Thread WARN: Swapping! 
GC CALLBACK, (K/V:29.74 GB + POJO:16.93 GB + FREE:10.84 GB == MEM_MAX:57.50
GB), desiredKV=7.19 GB OOM!
Exception in thread "HashSessionScavenger-0" java.lang.OutOfMemoryError: GC
overhead limit exceeded
at
java.util.concurrent.ConcurrentHashMap$ValuesView.iterator(ConcurrentHashMap.java:4683)
at
org.eclipse.jetty.server.session.HashSessionManager.scavenge(HashSessionManager.java:314)
at

Re: ML regression - spark context dies without error

2016-05-12 Thread AlexModestov
Hello,
I have the same problem... Sometimes I get the error: "Py4JError: Answer
from Java side is empty"
Sometimes my code works fine but sometimes not...
Did you find why it might come? What was the reason?
Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ML-regression-spark-context-dies-without-error-tp22633p26938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need for advice - performance improvement and out of memory resolution

2016-05-12 Thread AlexModestov
Hello.
I'm sorry but did you find the answer?
I have the similar error and I can not solve it... No one answered me...
Spark driver dies and I get the error "Answer from Java side is empty".
I thought that it is so because I made a mistake this conf-file

I use Sparkling Water 1.6.3, Spark 1.6.
I use Java Oracle 8 or OpenJDK-7:
(every time I get this error when I transform Spark DataFrame into H2O
DataFrame.

ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
  File ".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):
  File ".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
690, in start
self.socket.connect((self.address, self.port))
  File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):

My conf-file:
spark.serializer org.apache.spark.serializer.KryoSerializer 
spark.kryoserializer.buffer.max 1500mb
spark.driver.memory 65g
spark.driver.extraJavaOptions -XX:-PrintGCDetails -XX:PermSize=35480m
-XX:-PrintGCTimeStamps -XX:-PrintTenuringDistribution  
spark.python.worker.memory 65g
spark.local.dir /data/spark-tmp
spark.ext.h2o.client.log.dir /data/h2o
spark.logConf false
spark.master local[*]
spark.driver.maxResultSize 0
spark.eventLog.enabled True
spark.eventLog.dir /data/spark_log

In the code I use "persist" data (amount of data is 5.7 GB).
I guess that there is enough memory.
Could anyone help me?
Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-for-advice-performance-improvement-and-out-of-memory-resolution-tp24886p26937.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Error: "Answer from Java side is empty"

2016-05-11 Thread AlexModestov
I use Sparkling Water 1.6.3, Spark 1.6.I use Java Oracle 8 or
OpenJDK-7:(every time I get this error when I transform Spark DataFrame into
H2O DataFrame. Spark cluster dies..):ERROR:py4j.java_gateway:Error while
sending or receiving.Traceback (most recent call last):  File
".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746,
in send_commandraise Py4JError("Answer from Java side is
empty")Py4JError: Answer from Java side is emptyERROR:py4j.java_gateway:An
error occurred while trying to connect to the Java serverTraceback (most
recent call last):  File
".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690,
in startself.socket.connect((self.address, self.port))  File
"/usr/local/anaconda/lib/python2.7/socket.py", line 228, in methreturn
getattr(self._sock,name)(*args)error: [Errno 111] Connection
refusedERROR:py4j.java_gateway:An error occurred while trying to connect to
the Java serverTraceback (most recent call last):My
conf-file:spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 1500mbspark.driver.memory
65gspark.driver.extraJavaOptions -XX:-PrintGCDetails -XX:PermSize=35480m
-XX:-PrintGCTimeStamps -XX:-PrintTenuringDistribution 
spark.python.worker.memory 65gspark.local.dir
/data/spark-tmpspark.ext.h2o.client.log.dir /data/h2ospark.logConf
falsespark.master local[*]spark.driver.maxResultSize 0spark.eventLog.enabled
Truespark.eventLog.dir /data/spark_logIn the code I use "persist" data
(amount of data is 5.7 GB).There is nothing in the h2olog-files.I guess that
there is enough memory.Could anyone help me?Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-Answer-from-Java-side-is-empty-tp26929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

SQL Driver

2016-04-19 Thread AlexModestov
Hello all,
I use a string when I'm launching the Sparkling-Water:
"--conf
spark.driver.extraClassPath='/SQLDrivers/sqljdbc_4.2/enu/sqljdbc41.jar"
and I get the error:
"
---
TypeError Traceback (most recent call last)
 in ()
  1 from pysparkling import *
> 2 hc = H2OContext(sc).start()

/tmp/modestov/spark/work/spark-5695a33c-905d-4af5-a719-88b7be0e0c45/userFiles-77e075c2-41cc-44d6-96fb-a2668b112133/pySparkling-1.6.1-py2.7.egg/pysparkling/context.py
in __init__(self, sparkContext)
 70 def __init__(self, sparkContext):
 71 try:
---> 72 self._do_init(sparkContext)
 73 # Hack H2OFrame from h2o package
 74 _monkey_patch_H2OFrame(self)

/tmp/modestov/spark/work/spark-5695a33c-905d-4af5-a719-88b7be0e0c45/userFiles-77e075c2-41cc-44d6-96fb-a2668b112133/pySparkling-1.6.1-py2.7.egg/pysparkling/context.py
in _do_init(self, sparkContext)
 94 gw = self._gw
 95 
---> 96 self._jhc =
jvm.org.apache.spark.h2o.H2OContext.getOrCreate(sc._jsc)
 97 self._client_ip = None
 98 self._client_port = None

TypeError: 'JavaPackage' object is not callable"
What does it mean?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Driver-tp26800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-13 Thread AlexModestov
I get this error.
Who knows what does it mean?

Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result:
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1
locations. Most recent failure cause:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1397)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1384)
at
org.apache.spark.sql.execution.TakeOrderedAndProject.collectData(basicOperators.scala:213)
at
org.apache.spark.sql.execution.TakeOrderedAndProject.doExecute(basicOperators.scala:223)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
at
org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.Union.doExecute(basicOperators.scala:144)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:187)
at
org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply$mcI$sp(python.scala:126)
at
org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
at
org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at

Re: spark.driver.extraClassPath and export SPARK_CLASSPATH

2016-04-13 Thread AlexModestov
I wrote in "spark-defaults.conf" spark.driver.extraClassPath '/dir'
or "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook"
/.../sparkling-water-1.6.1/bin/pysparkling \ --conf
spark.driver.extraClassPath='/.../sqljdbc41.jar'
Nothing works



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraClassPath-and-export-SPARK-CLASSPATH-tp26740p26774.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe

2016-04-12 Thread AlexModestov
I get an error while I form a dataframe from the parquet file:

Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result:
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1
locations. Most recent failure cause:



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/An-error-occurred-while-calling-z-org-apache-spark-sql-execution-EvaluatePython-takeAndServe-tp26764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark.driver.extraClassPath and export SPARK_CLASSPATH

2016-04-11 Thread AlexModestov
Hello, I've started to use Spark 1.6.1 before I used Spark 1.5.
I included the string export
SPARK_CLASSPATH="/SQLDrivers/sqljdbc_4.2/enu/sqljdbc41.jar" when I launched
pysparkling and it worked well.
But in version 1.6.1 there is an error that it's deprecated and I had to use
spark.driver.extraClassPath.
OK, there is the string spark.driver.extraClassPath
/SQLDrivers/sqljdbc_4.2/enu/sqljdbc41.jar in spark-defaults.conf but Spark
says that there is no suitable driver for working with SQL Server.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-extraClassPath-and-export-SPARK-CLASSPATH-tp26740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark demands HiveContext but I use only SqlContext

2016-04-11 Thread AlexModestov
Hello!
I work with SqlContext, I create a query to MS Sql Server and get data...
Spark says to me that I have to install hive...
I have started to use Spark 1.6.1 (before I used Spark 1.5 and I have never
heard about this necessity early)... 


Py4JJavaError: An error occurred while calling
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to
instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-demands-HiveContext-but-I-use-only-SqlContext-tp26738.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sql functions: row_number, percent_rank, rank,rowNumber

2016-03-10 Thread AlexModestov
Hello all,
I try to use some sql functions.
My task to renumber rows in DataFrame.
I use sql functions but they don't work and I don;t understand why.
I would appreciate you help to fix this issue.
Thank you!
The piece of my code:

"from pyspark.sql.functions import row_number, percent_rank, rank,
randn,rowNumber
res_sorted.select(rowNumber()).head(10)"

res_sorted - is a sorted dataframe.
The error is:

"AnalysisException: u"unresolved operator 'Project ['row_number() AS
'row_number()#2848];""



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sql-functions-row-number-percent-rank-rank-rowNumber-tp26448.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark.driver.maxResultSize doesn't work in conf-file

2016-02-20 Thread AlexModestov
I have a string spark.driver.maxResultSize=0 in the spark-defaults.conf.
But I get an error:

"org.apache.spark.SparkException: Job aborted due to stage failure: Total
size of serialized results of 18 tasks (1070.5 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)"

But if I write --conf spark.driver.maxResultSize=0 in pyspark-shell it works
fine.

Could anyone know how to fix it?
Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-maxResultSize-doesn-t-work-in-conf-file-tp26279.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



an error when I read data from parquet

2016-02-19 Thread AlexModestov
Hello everybody,

I use Python API and Scala API. I read data without problem with Python API:

"sqlContext = SQLContext(sc)
data_full = sqlContext.read.parquet("---")"

But when I use Scala:

"val sqlContext = new SQLContext(sc)
val data_full = sqlContext.read.parquet("---")"

I get the error (I use Spark-Notebook may be it is important):
"java.lang.ExceptionInInitializerError
at sun.misc.Unsafe.ensureClassInitialized(Native Method)
at
sun.reflect.UnsafeFieldAccessorFactory.newFieldAccessor(UnsafeFieldAccessorFactory.java:43)
at
sun.reflect.ReflectionFactory.newFieldAccessor(ReflectionFactory.java:140)
at java.lang.reflect.Field.acquireFieldAccessor(Field.java:1057)
at java.lang.reflect.Field.getFieldAccessor(Field.java:1038)
at java.lang.reflect.Field.get(Field.java:379)
at notebook.kernel.Repl.getModule$1(Repl.scala:203)
at notebook.kernel.Repl.iws$1(Repl.scala:212)
at notebook.kernel.Repl.liftedTree1$1(Repl.scala:219)
at notebook.kernel.Repl.evaluate(Repl.scala:199)
at
notebook.client.ReplCalculator$$anonfun$15$$anon$1$$anonfun$29.apply(ReplCalculator.scala:378)
at
notebook.client.ReplCalculator$$anonfun$15$$anon$1$$anonfun$29.apply(ReplCalculator.scala:375)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.NoSuchMethodException:
org.apache.spark.io.SnappyCompressionCodec.(org.apache.spark.SparkConf)
at java.lang.Class.getConstructor0(Class.java:2892)
at java.lang.Class.getConstructor(Class.java:1723)
at
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:71)
at
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
at
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
at
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
at
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:108)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
at
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
at
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
at
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.toJSON(DataFrame.scala:1724)
at
notebook.front.widgets.DataFrameView$class.notebook$front$widgets$DataFrameView$$json(DataFrame.scala:40)
at
notebook.front.widgets.DataFrameWidget.notebook$front$widgets$DataFrameView$$json$lzycompute(DataFrame.scala:64)
at
notebook.front.widgets.DataFrameWidget.notebook$front$widgets$DataFrameView$$json(DataFrame.scala:64)
at
notebook.front.widgets.DataFrameView$class.$init$(DataFrame.scala:41)
at notebook.front.widgets.DataFrameWidget.(DataFrame.scala:69)
at
notebook.front.ExtraLowPriorityRenderers$dataFrameAsTable$.render(renderer.scala:13)
at
notebook.front.ExtraLowPriorityRenderers$dataFrameAsTable$.render(renderer.scala:12)
at notebook.front.Widget$.fromRenderer(Widget.scala:32)
at
$line19.$rendered$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$.(:92)

Scala from Jupyter

2016-02-16 Thread AlexModestov
Hello!
I want to use Scala from Jupyter (or may be something else if you could
recomend anything. I mean an IDE). Does anyone know how I can do this?
Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-from-Jupyter-tp26234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org