"object cannot be cast to Double" using pipline with pyspark

2016-08-03 Thread colin
My colleagues use scala and I use python. They save a hive table ,which has doubletype columns. However there's no double in python. When I use /pipline.fit(dataframe)/, there occured an error: java.lang.ClassCastException: [Ljava.lang.Object: cnnot be cast to java.lang.Double.. I guess

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Mich Talebzadeh
Just to clarify are you building Spark 2 from source downloaded? Or are you referring to building a uber jar file using your code and mvn to submit with spark-submit etc? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

回复:How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread luohui20001
PS: I am using Spark1.6.1, kafka 0.10.0.0 ThanksBest regards! San.Luo - 原始邮件 - 发件人: 收件人:"user" 主题:How to get recommand result for users in a kafka SparkStreaming Application 日期:2016年08月03日 15点01分 hello

spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread lev
hi, in spark 1.5, to build an uber-jar, I would just compile the code with: mvn ... package and that will create one big jar with all the dependencies. when trying to do the same with spark 2.0, I'm getting a tar.gz file instead. this is the full command I'm using: mvn -Pyarn -Phive

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Saisai Shao
I guess you're mentioning about spark assembly uber jar. In Spark 2.0, there's no uber jar, instead there's a jars folder which contains all jars required in the run-time. For the end user it is transparent, the way to submit spark application is still the same. On Wed, Aug 3, 2016 at 4:51 PM,

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Saisai Shao
Use dominant resource calculator instead of default resource calculator will get the expected vcores as you wanted. Basically by default yarn does not honor cpu cores as resource, so you will always see vcore is 1 no matter what number of cores you set in spark. On Wed, Aug 3, 2016 at 12:11 PM,

How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread luohui20001
hello guys: I have an app which consumes json messages from kafka and recommend movies for the users in those messages ,the code like this : conf.setAppName("KafkaStreaming") val storageLevel = StorageLevel.DISK_ONLY val ssc = new StreamingContext(conf,

Relative path in absolute URI

2016-08-03 Thread Abhishek Ranjan
Hi All, I am trying to use spark 2.0 with hadoop-hdfs 2.7.2 with scala 2.11. I am not able to use below API to load file from my local hdfs. spark.sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load(path) .show()

Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?

2016-08-03 Thread dueckm
Hello, first of all - excuse me for sending this post more than once, but I am new to this mailing list and did not subscribe completely, so I suspect my previous postings will not be accepted ... I built a prototype that uses join and groupBy operations via Spark RDD API. Recently I migrated

converting a Dataset into JavaRDD

2016-08-03 Thread Carlo . Allocca
Hi All, I am trying to convert a Dataset into JavaRDD in order to apply a linear regression. I am using spark-core_2.10, version2.0.0 with Java 1.8. My current approach is: == Step 1: convert the Dataset into JavaRDD JavaRDD dataPoints =modelDS.toJavaRDD(); == Step 2: convert JavaRDD

Re: Extracting key word from a textual column

2016-08-03 Thread Mich Talebzadeh
Guys, We are moving in tangent here. The question was what is the easiest way of finding key words in a string column as in transactiondescription? I am aware of functions like regexp, instr, patindex etc. How in general this is done and not necessarily in Spark? For example. A naive question.

Spark steaming with Flume jobs failing

2016-08-03 Thread Bhupendra Mishra
Hi team, I have integrated SparkSteam with Flume and my flume as well spark job gets failed and getting following error. Your help will be highly appreciative. Many Thanks my flume configuration is as follows flume.conf *** agent1.sources = source1 agent1.sinks = sink1 agent1.channels =

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Jacek Laskowski
Hi Jestin, I need to expand on this in the Spark notes. Spark can handle data locality itself but if the Spark nodes run on separate nodes than HDFS' there's always the network between them that makes the performance worse comparing to co-location of Spark and HDFS nodes. These are mere details

Spark 2.0 - Case sensitive column names while reading csv

2016-08-03 Thread Aseem Bansal
While reading csv via DataFrameReader how can I make column names case sensitive? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html None of the options specified mention case sensitivity

Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
Hi, I have a Spark workflow that when run on a relatively small portion of data works fine, but when run on big data fails with strange errors. In the log files of failed executors I found the following errors: Firstly > Managed memory leak detected; size = 263403077 bytes, TID = 6524 And

[Thriftserver2] Controlling number of tasks

2016-08-03 Thread Yana Kadiyska
Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark reads these files, I end up with 315K tasks for a dataframe reading a few days worth of data. I now with a regular Spark job, I can use coalesce to come to a lower number of tasks. Is there a way to tell

Re: How to partition a SparkDataFrame using all distinct column values in sparkR

2016-08-03 Thread Sun Rui
SparkDataFrame.repartition() uses hash partitioning, it can guarantee that all rows of the same column value go to the same partition, but it does not guarantee that each partition contain only single column value. Fortunately, Spark 2.0 comes with gapply() in SparkR. You can apply an R

Re: Relative path in absolute URI

2016-08-03 Thread Sean Owen
https://issues.apache.org/jira/browse/SPARK-15899 On Tue, Aug 2, 2016 at 11:54 PM, Abhishek Ranjan wrote: > Hi All, > > I am trying to use spark 2.0 with hadoop-hdfs 2.7.2 with scala 2.11. I am > not able to use below API to load file from my local hdfs. > >

Python : StreamingContext isn't created properly

2016-08-03 Thread Paolo Patierno
Hi all, I wrote following function in Python : def createStreamingContext(): conf = (SparkConf().setMaster("local[2]").setAppName("my_app")) conf.set("spark.streaming.receiver.writeAheadLog.enable", "true") sc = SparkContext(conf=conf) ssc = StreamingContext(sc, 1)

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
Spark 2.0 has been released. Mind giving it a try :-) ? On Wed, Aug 3, 2016 at 9:11 AM, Rychnovsky, Dusan < dusan.rychnov...@firma.seznam.cz> wrote: > OK, thank you. What do you suggest I do to get rid of the error? > > > -- > *From:* Ted Yu >

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
I am confused. I tried to look for Spark that would have this issue fixed, i.e. https://github.com/apache/spark/pull/13027/ merged in, but it looks like the patch has not been merged for 1.6. How do I get a fixed 1.6 version? Thanks, Dusan

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
OK, thank you. What do you suggest I do to get rid of the error? From: Ted Yu Sent: Wednesday, August 3, 2016 6:10 PM To: Rychnovsky, Dusan Cc: user@spark.apache.org Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to

Error in building spark core on windows - any suggestions please

2016-08-03 Thread Tony Lane
I am trying to build spark in windows, and getting the following test failures and consequent build failures. [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ spark-core_2.11 --- --- T E S T S

Re: Stop Spark Streaming Jobs

2016-08-03 Thread Tony Lane
SparkSession exposes stop() method On Wed, Aug 3, 2016 at 8:53 AM, Pradeep wrote: > Thanks Park. I am doing the same. Was trying to understand if there are > other ways. > > Thanks, > Pradeep > > > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee >

Re: [Thriftserver2] Controlling number of tasks

2016-08-03 Thread Chanh Le
I believe there is no way to reduce tasks by Hive using coalesce because when It come to Hive just read the files and depend on number of files you put into. So The way to did was coalesce at the ELT layer put a small number of files as possible reduce IO time for reading file. > On Aug 3,

Calling KmeansModel predict method

2016-08-03 Thread Rohit Chaddha
The predict method takes a Vector object I am unable to figure out how to make this spark vector object for getting predictions from my model. Does anyone has some code in java for this ? Thanks Rohit

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
The latest QA run was no longer accessible (error 404): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59141/consoleFull Looking at the comments on the PR, there is not enough confidence in pulling in the fix into 1.6 On Wed, Aug 3, 2016 at 9:05 AM, Rychnovsky, Dusan <

SparkSession for RDBMS

2016-08-03 Thread Selvam Raman
Hi All, I would like to read the data from RDBMS to spark (2.0) using sparksession. How can i decide upper boundary, lower boundary and partitions. is there any specific approach available. How Sqoop2 decides number of partitions, upper and lower boundary if we are not specifying anything. --

Change nullable property in Dataset schema

2016-08-03 Thread Kazuaki Ishizaki
Dear all, Would it be possible to let me know how to change nullable property in Dataset? When I looked for how to change nullable property in Dataframe schema, I found the following approaches. http://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe

How does MapWithStateRDD distribute the data

2016-08-03 Thread Soumitra Johri
Hi, I am running a steaming job with 4 executors and 16 cores so that each executor has two cores to work with. The input Kafka topic has 4 partitions. With this given configuration I was expecting MapWithStateRDD to be evenly distributed across all executors, how ever I see that it uses only two

RE: Python : StreamingContext isn't created properly

2016-08-03 Thread Paolo Patierno
Sorry I missed the "return" on the last line ... coming from Scala :-) Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor Twitter : @ppatierno Linkedin : paolopatierno Blog : DevExperience From: ppatie...@live.com To:

Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mehdi Meziane
Hi Mich, The data is stored as parquet. The table definition looks like : CREATE EXTERNAL TABLE nadata ( extract_date TIMESTAMP, date_formatted STRING, day_of_week INT, hour_of_day INT, entity_label STRING, entity_currency_id INT, entity_currency_label STRING,

Re: Error in building spark core on windows - any suggestions please

2016-08-03 Thread Sean Owen
Hm, all of the Jenkins builds are OK, but none of them run on Windows. It could be a Windows-specific thing. It means the launcher process exited abnormally. I don't see any log output from the process, so maybe it failed entirely to launch. Anyone else on Windows seeing this fail or work? On

2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Kiran Chitturi
Hi, When Spark 2.0.0 is released, the 'spark-streaming-twitter' package and several other packages are not released/published to maven central. It looks like these packages are removed from the official repo of Spark. I found the replacement git repos for these missing packages at

Re: PermGen space Error

2016-08-03 Thread $iddhe$h Divekar
I am running the spark job in yarn-client mode. On Wed, Aug 3, 2016 at 8:14 PM, $iddhe$h Divekar wrote: > Hi, > > I am running spark jobs using apache oozie. > My job.properties has sparkConf which gets used in workflow.xml. > > I have tried increasing MaxPermSize

Re: how to debug spark app?

2016-08-03 Thread Sumit Khanna
Am not really sure of the best practices on this , but I either consult the localhost:4040/jobs/ etc or better this : val customSparkListener: CustomSparkListener = new CustomSparkListener() sc.addSparkListener(customSparkListener) class CustomSparkListener extends SparkListener { override def

Re: Spark 2.0: Task never completes

2016-08-03 Thread Utkarsh Sengar
Not sure what caused it but the partition size was 3 million there. The RDD was created from mongo hadoop 1.5.1 Earlier (mongo hadoop 1.3 and spark 1.5) it worked just fine, not sure what changed. A a fix, I applied a repartition(40) (where 40 varies by my processing logic) before the cartesian

PermGen space Error

2016-08-03 Thread $iddhe$h Divekar
Hi, I am running spark jobs using apache oozie. My job.properties has sparkConf which gets used in workflow.xml. I have tried increasing MaxPermSize using sparkConf in job.properties but that is not resolving the issue. *sparkConf*=--verbose --driver-java-options '-XX:MaxPermSize=8192M' --conf

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Kiran Chitturi
Thank you! On Wed, Aug 3, 2016 at 8:45 PM, Marcelo Vanzin wrote: > The Flume connector is still available from Spark: > > http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-streaming-flume-assembly_2.11%7C2.0.0%7Cjar > > Many of the others have indeed been

Re: how to debug spark app?

2016-08-03 Thread Ted Yu
Have you looked at: https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application If you use Mesos: https://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging On Wed, Aug 3, 2016 at 6:13 PM, glen wrote: > Any tool like gdb?

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-08-03 Thread Sun Rui
--num-executors does not work for Standalone mode. Try --total-executor-cores > On Jul 26, 2016, at 00:17, Mich Talebzadeh wrote: > > Hi, > > > I am doing some tests > > I have started Spark in Standalone mode. > > For simplicity I am using one node only with 8

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn yarn.scheduler.capacity.resource-calculator on, then check again. On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn "yarn.scheduler.capacity.resource-calculator" on On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default yarn does not >

how to debug spark app?

2016-08-03 Thread glen
Any tool like gdb? Which support break point at some line or some function?

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Marcelo Vanzin
The Flume connector is still available from Spark: http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-streaming-flume-assembly_2.11%7C2.0.0%7Cjar Many of the others have indeed been removed from Spark, and can be found at the Apache Bahir project: http://bahir.apache.org/ I

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Sean Owen
You're looking for http://bahir.apache.org/ On Wed, Aug 3, 2016 at 8:40 PM, Kiran Chitturi wrote: > Hi, > > When Spark 2.0.0 is released, the 'spark-streaming-twitter' package and > several other packages are not released/published to maven central. It looks > like

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread flavio marchi
Ho Sean, thanks for the reply. I just omitted the real path which would point on my file system. I just posted the real one. On 3 Aug 2016 19:09, "Sean Owen" wrote: > file: "absolute directory" > does not sound like a valid URI > > On Wed, Aug 3, 2016 at 11:05 AM, Flavio

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
Hi Sean, I did not understand, I created a KMeansModel with 3 dimensions and then I am calling predict method on this model with a 3 dimension vector ? I am not sre what is wrong in this approach. i am missing a point ? Tony On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen wrote:

Re: SparkSession for RDBMS

2016-08-03 Thread Takeshi Yamamuro
Hi, If these bounaries are not given, spark tries to read all the data as a single parition. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L56 // maropu On Wed, Aug 3, 2016 at 11:19 PM, Selvam Raman

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Ted Yu
SPARK-15899 ? On Wed, Aug 3, 2016 at 11:05 AM, Flavio wrote: > Hello everyone, > > I am try to run a very easy example but unfortunately I am stuck on the > follow exception: > > Exception in thread "main"

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread flavio marchi
Thanks for the reply, well I don't have hadoop installed at all. I am just running in local as eclipse project so I don't know where I can configure the path as suggested per JIRA :( On 3 Aug 2016 19:07, "Ted Yu" wrote: > SPARK-15899

Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
I am using the following vector definition in java Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 })) However when I run the predict method on this vector it leads to Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3 at

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Sean Owen
file: "absolute directory" does not sound like a valid URI On Wed, Aug 3, 2016 at 11:05 AM, Flavio wrote: > Hello everyone, > > I am try to run a very easy example but unfortunately I am stuck on the > follow exception: > > Exception in thread "main"

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Flavio
Just for clarification, this is the fullstracktrace: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/08/03 18:18:44 INFO SparkContext: Running Spark version 2.0.0 16/08/03 18:18:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform...

Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Gourav Sengupta
from what I am observing your path is s3a://buckets3/day=2016-07-25 and partition is day, mba_id and partition_id. Are the sub folders in the form s3://buckets3/day=2016-07-25/mba_id=1122/partition_id=111? can you please include the add partition statement as well for one single partition? The

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
One more: it seems that the steps == Step 1: transform the Dataset into JavaRDD JavaRDD dataPointsWithHeader =dataset1_Join_dataset2.toJavaRDD(); and List someRows = dataPointsWithHeader.collect(); someRows.forEach(System.out::println); do not print the header. So, Could I

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
I guess the setup of the model and usage of the vector got to me. Setup takes position 1 , 2 , 3 - like this in the build example - "1:0.0 2:0.0 3:0.0" I thought I need to follow the same numbering while creating vector too. thanks a bunch On Thu, Aug 4, 2016 at 12:39 AM, Sean Owen

Re: unsubscribe

2016-08-03 Thread Daniel Lopes
please send to user-unsubscr...@spark.apache.org *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br On Tue, Aug 2, 2016 at 10:11 AM,

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen
Yeah, that's libsvm format, which is 1-indexed. On Wed, Aug 3, 2016 at 12:45 PM, Tony Lane wrote: > I guess the setup of the model and usage of the vector got to me. > Setup takes position 1 , 2 , 3 - like this in the build example - "1:0.0 > 2:0.0 3:0.0" > I thought I

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen
You mean "new int[] {0,1,2}" because vectors are 0-indexed. On Wed, Aug 3, 2016 at 11:52 AM, Tony Lane wrote: > Hi Sean, > > I did not understand, > I created a KMeansModel with 3 dimensions and then I am calling predict > method on this model with a 3 dimension vector ?

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Hi Mich, Thanks again. My issue is not when I read the csv from a file. It is when you have a Dataset that is output of some join operations. Any help on that? Many Thanks, Best, Carlo On 3 Aug 2016, at 21:43, Mich Talebzadeh >

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
hm odd. Otherwise you can try using databricks to read the CSV file. This is scala example //val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868") val df =

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
ok in other words the result set of joining two dataset ends up with inconsistent result as a header from one DS is joined with another row from another DS? You really need to get rid of headers one way or other before joining. or try to register them as temp table before join to see where the

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
On 3 Aug 2016, at 22:01, Mich Talebzadeh > wrote: ok in other words the result set of joining two dataset ends up with inconsistent result as a header from one DS is joined with another row from another DS? I am not 100% sure I got

RE: Spark 2.0 error: Wrong FS: file://spark-warehouse, expected: file:///

2016-08-03 Thread Ulanov, Alexander
Hi Sean, I updated the issue, could you check the changes? Best regards, Alexander -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, August 03, 2016 2:49 AM To: Utkarsh Sengar Cc: User Subject: Re: Spark 2.0

Re: How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread Cody Koeninger
MatrixFactorizationModel is serializable. Instantiate it on the driver, not on the executors. On Wed, Aug 3, 2016 at 2:01 AM, wrote: > hello guys: > I have an app which consumes json messages from kafka and recommend > movies for the users in those messages ,the

Sparkstreaming not consistently picking files from streaming directory

2016-08-03 Thread ravi.gawai
I am using spark streaming for text files. I have feeder program which moves files to spark streaming directory. while Spark processing particular file at the same time if feeders puts another file into streaming directory, sometimes spark does not pick file for processing. we are using

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-08-03 Thread Michael Gummelt
DC/OS was designed to reduce the operational cost of maintaining a cluster, and DC/OS Spark runs well on it. On Sat, Jul 16, 2016 at 11:11 AM, Teng Qiu wrote: > Hi Mark, thanks, we just want to keep our system as simple as > possible, using YARN means we need to maintain a

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
Are you using Spark 1.6+ ? See SPARK-11293 On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan < dusan.rychnov...@firma.seznam.cz> wrote: > Hi, > > > I have a Spark workflow that when run on a relatively small portion of > data works fine, but when run on big data fails with strange errors. In

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
Yes, I believe I'm using Spark 1.6.0. > spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen
Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G

[SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mehdi Meziane
Hi all, We have a hive table stored in S3 and registered in a hive metastore. This table is partitionned with a key "day". So we access this table through the spark dataframe API as : sqlContext.read() .table("tablename) .where(col("day").between("2016-08-01","2016-08-02")) When

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-08-03 Thread Michael Gummelt
> but Spark on Mesos is certainly lagging behind Spark on YARN regarding the features Spark uses off the scheduler backends -- security, data locality, queues, etc. If by security you mean Kerberos, we'll be upstreaming that to Apache Spark soon. It's been in DC/OS Spark for a while:

Re: how to use spark.mesos.constraints

2016-08-03 Thread Michael Gummelt
If you run your jobs with debug logging on in Mesos, it should print why the offer is being declined: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L301 On Tue, Jul 26, 2016 at 6:38 PM, Rodrick

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Mich Talebzadeh
Lev, Mine worked with this dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided" [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 11.084 s] [INFO] Spark Project Tags

Spark 2.0 empty result in some tpc-h queries

2016-08-03 Thread eviekas
Hello, I recently upgraded from Spark 1.6 to Spark 2.0 with Hadoop 2.7. I have some Scala code which I updated for Spark 2.0 that creates parquet tables from textfiles and runs tpc-h queries with the spark.sql module on a 1TB dataset. Most of the queries run as expected with correct results but

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Yong Zhang
Data Locality is part of job/task scheduling responsibility. So both links you specified originally are correct, one is for the standalone mode comes with Spark, another is for the YARN. Both have this ability. But YARN, as a very popular scheduling component, comes with MUCH, MUCH more

Re: converting a Dataset into JavaRDD

2016-08-03 Thread Carlo . Allocca
problem solved. The package org.apache.spark.api.java.function.Function was missing. Thanks. Carlo On 3 Aug 2016, at 12:14, Carlo.Allocca > wrote: Hi All, I am trying to convert a Dataset into JavaRDD in order to apply a linear