Re: 'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-07-03 Thread sneha29shukla
Hi, Any pointers? I'm not sure if this thread is reaching the right audience? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/numBins-property-not-honoured-in-BinaryClassificationMetrics-class-when-spark-default-parallelism-is1-tp27204p27269.html Sent from

Friendly Reminder: Spark Summit EU CfP Deadline July 1, 2016

2016-06-29 Thread Jules Damji
Hello All, If you haven't submitted a CfP for Spark Summit EU, the deadline is this Friday, July 1st. Submit at https://spark-summit.org/eu-2016/ Cheers! Jules Spark Community Evangelist Databricks, Inc. Sent from my iPhone Pardon the dumb thumb typos :)

'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-06-22 Thread sneha29shukla
trics import org.apache.spark.{SparkConf, SparkContext} object TestCode { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local") sparkConf.set("spark.default.parallelism","

Fwd: 'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-06-21 Thread Sneha Shukla
trics import org.apache.spark.{SparkConf, SparkContext} /** * Created by sneha.shukla on 17/06/16. */ object TestCode { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local") sparkConf.set("

Re: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
askresult_8 stored as bytes in >> memory (estimated size 2.0 MB, free 2.4 GB) >> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_8 in memory on >> 192.168.56.1:56413 (size: 2.0 MB, free: 2.4 GB) >> 16/06/15 19:45:22 INFO Executor: Finished task 2.0

Fwd: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
.1:56413 (size: 2.0 MB, free: 2.4 GB) > 16/06/15 19:45:22 INFO Executor: Finished task 2.0 in stage 2.0 (TID 8). > 2143771 bytes result sent via BlockManager) > 16/06/15 19:45:43 ERROR RetryingBlockFetcher: Exception while beginning &

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread أنس الليثي
submit the spark job >> >> *nohup sh $SPZRK_HOME/bin/spark-submit --total-executor-cores 5 --class >> org.css.java.gnipStreaming.GnipSparkStreamer --master >> spark://kafka-b05:7077 >> GnipStreamContainer.jar powertrack >> kafka-b01.css.org,kafka-b02.css.org,kafka-b03.css.org,kafka-b

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Rodrick Brown
track kafka-b01.css.org,kafka-b02.css.org,kafka-b03.css.org,kafka-b04.css.org,kafka-b05.css.org gnip_live_stream 2 &* After about 1 hour the spark job get killed The logs in the nohub file shows the following exception /org.apache.spark.storage.BlockFetchException: Failed to fetch block from 2 loc

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Ted Yu
> > *nohup sh $SPZRK_HOME/bin/spark-submit --total-executor-cores 5 --class > org.css.java.gnipStreaming.GnipSparkStreamer --master > spark://kafka-b05:7077 > GnipStreamContainer.jar powertrack > kafka-b01.css.org,kafka-b02.css.org,kafka-b03.css.org,kafka-b04.css.org, > kafka-b0

Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread fanooos
/spark-submit --total-executor-cores 5 --class org.css.java.gnipStreaming.GnipSparkStreamer --master spark://kafka-b05:7077 GnipStreamContainer.jar powertrack kafka-b01.css.org,kafka-b02.css.org,kafka-b03.css.org,kafka-b04.css.org,kafka-b05.css.org gnip_live_stream 2 &* After about 1 hour the s

java.lang.RuntimeException: Executor is not registered - Spark 1.

2016-04-06 Thread Rodrick Brown
n all my slaves. My command args look like the following: /opt/spark-1.6.1/bin/spark-submit --master "mesos://zk://prod-zookeeper-1:2181 ,prod-zookeeper-2:2181,prod-zookeeper-3:2181/mesos" \--conf spark.ui.port=31232 \ \--conf spark.mesos.coarse=true \ \--conf spark.mesos.constraint

Re: Spark + Kafka all messages being used in 1 batch

2016-03-06 Thread Shahbaz
t;- What is the Message Size and type.==> >> *String, 9,550 bytes (around) * >>- How big is the spark cluster (How many executors ,How many cores >>Per Executor)==>* 2 Nodes, 16 executors, 1 core per executor* >>- What does your Spark Job looks

Re: Spark + Kafka all messages being used in 1 batch

2016-03-06 Thread Vinti Maheshwari
many executors ,How many cores Per >Executor)==>* 2 Nodes, 16 executors, 1 core per executor* >- What does your Spark Job looks like ==> > > >val messages = KafkaUtils.createDirectStream[String, String, > StringDecoder, StringDecoder]( > ssc, kafkaParams, top

Re: Spark + Kafka all messages being used in 1 batch

2016-03-06 Thread Vinti Maheshwari
ng, 9,550 bytes (around) * - How big is the spark cluster (How many executors ,How many cores Per Executor)==>* 2 Nodes, 16 executors, 1 core per executor* - What does your Spark Job looks like ==> val messages = KafkaUtils.createDirectStream[String, String, StringDecoder

Re: Spark + Kafka all messages being used in 1 batch

2016-03-05 Thread Shahbaz
Hello, - Which version of Spark you are using. - How big is the Kafka Cluster - What is the Message Size and type. - How big is the spark cluster (How many executors ,How many cores Per Executor) - What does your Spark Job looks like . spark.streaming.backpressure.enabled set it

Re: Spark + Kafka all messages being used in 1 batch

2016-03-05 Thread Supreeth
Try setting spark.streaming.kafka.maxRatePerPartition, this can help control the number of messages read from Kafka per partition on the spark streaming consumer. -S > On Mar 5, 2016, at 10:02 PM, Vinti Maheshwari wrote: > > Hello, > > I am trying to figure out why my kafka+spark job is run

Spark + Kafka all messages being used in 1 batch

2016-03-05 Thread Vinti Maheshwari
Hello, I am trying to figure out why my kafka+spark job is running slow. I found that spark is consuming all the messages out of kafka into a single batch itself and not sending any messages to the other batches. 2016/03/05 21:57:05

Re: Spark master takes more time with local[8] than local[1]

2016-01-25 Thread nsalian
. Salian Cloudera -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-master-takes-more-time-with-local-8-than-local-1-tp26052p26061.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark master takes more time with local[8] than local[1]

2016-01-24 Thread Ted Yu
500 GB HDD > 8 CPUs > > Following are the parameters i'm starting my Spark context with: > > val conf = new > > SparkConf().setAppName("MasterApp").setMaster("local[1]").set("spark.executor.memory", > "20g") > > I'm reading a 4

Spark master takes more time with local[8] than local[1]

2016-01-24 Thread jimitkr
Hi All, I have a machine with the following configuration: 32 GB RAM 500 GB HDD 8 CPUs Following are the parameters i'm starting my Spark context with: val conf = new SparkConf().setAppName("MasterApp").setMaster("local[1]").set("spark.executor.memory", &q

RE: Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-21 Thread Siddharth Ubale
: Siddharth Ubale Subject: Re: Container exited with a non-zero exit code 1-SparkJOb on YARN Hi, For the memory issues, you might need to review current values for maximum allowed container memory on YARN configuration. Check values current defined for "yarn.nodemanager.resource.memory-mb

Re: Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-20 Thread Shixiong(Ryan) Zhu
; a kafka topic , inserting the JSON values to hbase tables via Phoenix , > ands then sending out certain messages to a websocket if the JSON satisfies > a certain criteria. > > > > My cluster is a 3 node cluster with 24GB ram and 24 cores in total. > > > > Now : > >

Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-20 Thread Siddharth Ubale
. My cluster is a 3 node cluster with 24GB ram and 24 cores in total. Now : 1. when I am submitting the job with 10GB memory, the application fails saying memory is insufficient to run the job 2. The job is submitted with 6G ram. However, it does not run successfully always.Common issues faced

Container exited with a non-zero exit code 1-SparkJOb on YARN

2016-01-20 Thread Siddharth Ubale
. My cluster is a 3 node cluster with 24GB ram and 24 cores in total. Now : 1. when I am submitting the job with 10GB memory, the application fails saying memory is insufficient to run the job 2. The job is submitted with 6G ram. However, it does not run successfully always.Common issues faced

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Steve Loughran
to run periodically where I need to read > on the order of 1+ million files from an S3 bucket. It is not the entire > bucket (nor does it match a pattern). Instead, I have a list of random keys > that are 'names' for the files in this S3 bucket. The bucket itself will >

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
h > what I'm doing already. Was just thinking there might be a better way. > > Darin. > -- > *From:* Daniel Imberman > *To:* Darin McBeath ; User > *Sent:* Wednesday, January 13, 2016 2:48 PM > *Subject:* Re: Best practice for retrievin

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman
iences. > > I currently have a job that I need to run periodically where I need to > read on the order of 1+ million files from an S3 bucket. It is not the > entire bucket (nor does it match a pattern). Instead, I have a list of > random keys that are 'names' for the files in thi

Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Darin McBeath
I'm looking for some suggestions based on other's experiences. I currently have a job that I need to run periodically where I need to read on the order of 1+ million files from an S3 bucket. It is not the entire bucket (nor does it match a pattern). Instead, I have a list of random

Re: How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-13 Thread unk1102
hive orc table partitions and with Spark it is easy to load orc files in DataFrame. Initially I have 24 orc files/split and hence 24 partitions but when I do sourceFrame.toJavaRDD().coalesce(1,true) this is where it stucks hangs for hours and do nothing I am sure even it is not hitting 2GB limit as

How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-12 Thread unk1102
Hi I have the following code which I run as part of thread which becomes child job of my main Spark job it takes hours to run for large data around 1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster with more data sets size sometimes it does not complete at all. Please

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Andy Davidson
coalesce(1).saveAsTextfile() takes forever? > hi I am trying to save many partitions of Dataframe into one CSV file and it > take forever for large data sets of around 5-6 GB. > > sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzi > p&qu

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Umesh Kacha
Hi dataframe has not boolean option for coalesce it is only for RDD I believe sourceFrame.coalesce(1,true) //gives compilation error On Wed, Jan 6, 2016 at 1:38 AM, Alexander Pivovarov wrote: > try coalesce(1, true). > > On Tue, Jan 5, 2016 at 11:58 AM, unk1102 wrote: > >>

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
another option will be to try rdd.toLocalIterator() not sure if it will help though I had same problem and ended up to move all parts to local disk(with Hadoop FileSystem api) and then processing them locally On 5 January 2016 at 22:08, Alexander Pivovarov wrote: > try coalesce(1, t

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Alexander Pivovarov
try coalesce(1, true). On Tue, Jan 5, 2016 at 11:58 AM, unk1102 wrote: > hi I am trying to save many partitions of Dataframe into one CSV file and > it > take forever for large data sets of around 5-6 GB. > > > sourceFrame.coalesce(1).write().format("com.databricks.s

coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
hi I am trying to save many partitions of Dataframe into one CSV file and it take forever for large data sets of around 5-6 GB. sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") For small data above code works we

Re: Download Problem with Spark 1.5.2 pre-built for Hadoop 1.X

2015-12-17 Thread Jean-Baptiste Onofré
Hi, we have a Jira about that (let me find it): by default, a suffix is appended causing issue to resolve the artifact. Let me find the Jira and the workaround. Regards JB On 12/17/2015 12:48 PM, abc123 wrote: Get error message when I try to download Spark 1.5.2 pre-built for Hadoop 1.X

Download Problem with Spark 1.5.2 pre-built for Hadoop 1.X

2015-12-17 Thread abc123
Get error message when I try to download Spark 1.5.2 pre-built for Hadoop 1.X. Can someone help me please? Error: http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop1.tgz NoSuchKey The specified key does not exist. spark-1.5.2-bin-hadoop1.tgz

worker:java.lang.ClassNotFoundException: ttt.test$$anonfun$1

2015-12-14 Thread Bonsen
27;s ip 15/12/14 03:18:34 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 219.216.64.55): java.lang.ClassNotFoundException: ttt.test$$anonfun$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessControll

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Thank you for your reply! I have already done the change locally. So for changing it would be fine. I just wanted to be sure which way is correct. On 9 Dec 2015 18:20, "Fengdong Yu" wrote: > I don’t think there is performance difference between 1.x API and 2.x API. > > but i

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API. but it’s not a big issue for your change, only com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java <https://github.com/databricks/spark-xml/blob/master/src/main/java/com/databricks/hadoop/mapreduce/lib/in

Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all, I am writing this email to both user-group and dev-group since this is applicable to both. I am now working on Spark XML datasource ( https://github.com/databricks/spark-xml). This uses a InputFormat implementation which I downgraded to Hadoop 1.x for version compatibility. However, I

Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-12-06 Thread Mohamed Nadjib Mami
Your jars are not delivered to the workers. Have a look at this: http://stackoverflow.com/questions/24052899/how-to-make-it-easier-to-deploy-my-jar-to-spark-cluster-in-standalone-mode -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-Cl

Re: spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread Sahil Sareen
"select 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0 end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb" Is this as is, or did you use a UDF here? -Sahil On Thu,

RE: spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread Mich Talebzadeh
onments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is fo

spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread hxw黄祥为
max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0 end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb") b.show //the result just one line but this step is extremely slow Is this expected? Why sh

RE: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-24 Thread Shuai Zheng
=45G If I set the –executor-cores=20 (or anything less than 20, there is no issue). This is a quite interesting case, because the instance (C3*8xlarge) has 32 virtual core and can run without any issue with one task . So I guess the issue should come from: 1, connection limit from EC2

Re: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Jonathan Kelly
would help you figure out what it is doing. You should also look at the YARN container logs (which automatically get uploaded to your S3 logs bucket if you have this enabled). ~ Jonathan On Thu, Nov 19, 2015 at 1:32 PM, Shuai Zheng wrote: > Hi All, > > > > I face a very weird case

Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Shuai Zheng
master and one task node. But when I try to run a multiple node (more than 1 task node, which means 3 nodes cluster), the tasks will never return from one of it. The log as below: 15/11/19 21:19:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-10-165-121-188.ec2.internal

Want 1-1 map between input files and output files in map-only job

2015-11-19 Thread Arun Luthra
Hello, Is there some technique for guaranteeing that there is a 1-1 correspondence between the input files and the output files? For example if my input directory has files called input001.txt, input002.txt, ... etc. I would like Spark to generate output files named something like part-1

Re: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
I will Thank you. > On 27 באוק׳ 2015, at 4:54, Felix Cheung wrote: > > Please open a JIRA? > > > Date: Mon, 26 Oct 2015 15:32:42 +0200 > Subject: HiveContext ignores ("skip.header.line.count"="1") > From: daniel.ha...@veracity-group.com > To

RE: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Cheng, Hao
ignores ("skip.header.line.count"="1") Please open a JIRA? Date: Mon, 26 Oct 2015 15:32:42 +0200 Subject: HiveContext ignores ("skip.header.line.count"="1") From: daniel.ha...@veracity-group.com<mailto:daniel.ha...@ve

RE: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Felix Cheung
Please open a JIRA? Date: Mon, 26 Oct 2015 15:32:42 +0200 Subject: HiveContext ignores ("skip.header.line.count"="1") From: daniel.ha...@veracity-group.com To: user@spark.apache.org Hi,I have a csv table in Hive which is configured to skip the header row

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Koert Kuipers
it seems HadoopFsRelation keeps track of all part files (instead of just the data directories). i believe this has something to do with parquet footers but i didnt bother to look more into it. but yet the result is that driver side it: 1) tries to keep track of all part files in a Map[Path

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
uilder.java:130) >> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) >> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) >> java.lang.StringBuilder.append(StringBuilder.java:132) >> org.apache.hadoop.fs.Path.toString(Pa

HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
Hi, I have a csv table in Hive which is configured to skip the header row using TBLPROPERTIES("skip.header.line.count"="1"). When querying from Hive the header row is not included in the data, but when running the same query via HiveContext I get the header row. I made sure t

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Fengdong Yu
) > org.apache.hadoop.fs.Path.toString(Path.java:384) > org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(interfaces.scala:447) > org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(interfaces.scala:447

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
(AbstractStringBuilder.java:415) java.lang.StringBuilder.append(StringBuilder.java:132) org.apache.hadoop.fs.Path.toString(Path.java:384) org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(interfaces.scala:447) org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
t is marked as >> resolved. However it is not. I have over a million output directories for 1 >> single column in partitionBy. Not sure if this is a regression issue? Do I >> need to set some parameters to make it more memory efficient? >> >> Best Regards, >>

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen
hink I hit the same issue SPARK-8890 > https://issues.apache.org/jira/browse/SPARK-8890. It is marked as > resolved. However it is not. I have over a million output directories for 1 > single column in partitionBy. Not sure if this is a regression issue? Do I > need to set some paramet

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi spark guys, I think I hit the same issue SPARK-8890 https://issues.apache.org/jira/browse/SPARK-8890. It is marked as resolved. However it is not. I have over a million output directories for 1 single column in partitionBy. Not sure if this is a regression issue? Do I need to set some

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi guys, After waiting for a day, it actually causes OOM on the spark driver. I configure the driver to have 6GB. Note that I didn't call refresh myself. The method was called when saving the dataframe in parquet format. Also I'm using partitionBy() on the DataFrameWriter to gener

[Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-24 Thread Jerry Lam
Hi Spark users and developers, Does anyone encounter any issue when a spark SQL job produces a lot of files (over 1 millions), the job hangs on the refresh method? I'm using spark 1.5.1. Below is the stack trace. I saw the parquet files are produced but the driver is doing something

Re: error in sparkSQL 1.5 using count(1) in nested queries

2015-10-09 Thread Michael Armbrust
rom 1.4.1 to 1.5.1 I found some of my spark SQL queries > no longer worked. Seems to be related to using count(1) or count(*) in a > nested query. I can reproduce the issue in a pyspark shell with the sample > code below. The ‘people’ table is from spark-1.5.1-bin-hadoop2.4/ > exampl

error in sparkSQL 1.5 using count(1) in nested queries

2015-10-08 Thread Jeff Thompson
After upgrading from 1.4.1 to 1.5.1 I found some of my spark SQL queries no longer worked. Seems to be related to using count(1) or count(*) in a nested query. I can reproduce the issue in a pyspark shell with the sample code below. The ‘people’ table is from spark-1.5.1-bin-hadoop2.4/ examples

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Marcelo Vanzin
; > > > From: Andy Dang > Sent: Wednesday, September 30, 2015 8:17 PM > To: Nicolae Marasoiu > Cc: user@spark.apache.org > Subject: Re: sc.parallelize with defaultParallelism=1 > > Can't you just load the data from HBase first, and then

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
m/r part. From: Andy Dang Sent: Wednesday, September 30, 2015 8:17 PM To: Nicolae Marasoiu Cc: user@spark.apache.org Subject: Re: sc.parallelize with defaultParallelism=1 Can't you just load the data from HBase first, and then call sc.parallelize on your dataset? -Andy ---

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Andy Dang
Can't you just load the data from HBase first, and then call sc.parallelize on your dataset? -Andy --- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > When calling sc.parallelize(data,1)

sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
Hi, When calling sc.parallelize(data,1), is there a preference where to put the data? I see 2 possibilities: sending it to a worker node, or keeping it on the driver program. I would prefer to keep the data local to the driver. The use case is when I need just to load a bit of data from

RE: Why is 1 executor overworked and other sit idle?

2015-09-23 Thread Richard Eggert
to disk is expensive as well, so all your expensive operations are the ones that involve a single partition. On Sep 22, 2015 11:45 PM, "Chirag Dewan" wrote: > Thanks Ted and Rich. > > > > So if I repartition my RDD programmatically and call coalesce on the RDD > to 1 p

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Jonathan Coveney
in: *18.27 seconds*/ > - for a 38Mb file: > /lines found: 821 > in: 2.53 seconds/ > > I must do something wrong to obtain a result twice as slow on the 8 threads > than on 1 thread. > > 1. First, I thought it might be because of the setting-up cost of

RE: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Chirag Dewan
Thanks Ted and Rich. So if I repartition my RDD programmatically and call coalesce on the RDD to 1 partition would that generate 1 output file? Ahh.. Is my coalesce operation causing 1 partition, hence 1 output file and 1 executor working on all the data? To summarize this is what I do :- 1

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Richard Eggert
.append(row.getString(7)); > > > > return > sb.toString(); > > } > > > > My map methods looks like this. > > > > I am having a 3 node cluster. I observe that driver starts on Node A. An

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Richard Eggert
in: 2.53 seconds/ > > I must do something wrong to obtain a result twice as slow on the 8 threads > than on 1 thread. > > 1. First, I thought it might be because of the setting-up cost of Spark. > But > for smaller files it only takes 2 seconds which makes this option > impro

Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread juljoin
82100 in: *18.27 seconds*/ - for a 38Mb file: /lines found: 821 in: 2.53 seconds/ I must do something wrong to obtain a result twice as slow on the 8 threads than on 1 thread. 1. First, I thought it might be because o

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Ted Yu
gt; > My map methods looks like this. > > I am having a 3 node cluster. I observe that driver starts on Node A. And > executors are spawned on all 3 nodes. But the executor of Node B or C are > doing all the tasks. It starts a saveasTextFile job with 1 output partition > and

Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Chirag Dewan
3 node cluster. I observe that driver starts on Node A. And executors are spawned on all 3 nodes. But the executor of Node B or C are doing all the tasks. It starts a saveasTextFile job with 1 output partition and stores the RDDs in memory and also commits the file on local file system. This exe

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In scala, df.limit(1).rdd will also trigger the issue you observed. I will add this in the jira. On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote: > I just noticed you found 1.4 has the same issue. I added that as well

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens als

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
actually a bit. I created a ticket for this (SPARK-10731 <https://issues.apache.org/jira/browse/SPARK-10731>). Best Regards, Jerry On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > &

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
t; >> Thanks, >> >> Yin >> >> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: >> >>> Hi Spark Developers, >>> >>> I just ran some very simple operations on a dataset. I was surprise by >>> the execution plan of take(1), head

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
velopers, >> >> I just ran some very simple operations on a dataset. I was surprise by >> the execution plan of take(1), head() or first(). >> >> For your reference, this is what I did in pyspark 1.5: >> df=sqlContext.read.parquet("someparquetfiles") &

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry, Looks like it is a Python-specific issue. Can you create a JIRA? Thanks, Yin On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > Hi Spark Developers, > > I just ran some very simple operations on a dataset. I was surprise by the > execution plan of take(1), hea

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-09 Thread Umesh Kacha
DataFrame which again I use it to register as temp table using hiveContext and execute few insert into partitions query using hiveContext.sql Please help me optimize above code. On Sep 9, 2015 2:55 AM, "Richard Marscher" wrote: > Hi, > > what is the reasoning behind the use of

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-08 Thread Richard Marscher
Hi, what is the reasoning behind the use of `coalesce(1,false)`? This is saying to aggregate all data into a single partition, which must fit in memory on one node in the Spark cluster. If the cluster has more than one node it must shuffle to move the data. It doesn't seem like the followin

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-04 Thread unk1102
as expected only causing slowness because of GC pause and shuffles lots of data so hitting memory issues. Please guide I am new to Spark. Thanks in advance. JavaRDD updatedDsqlRDD = orderedFrame.toJavaRDD().coalesce(1, false).map(new Function() { @Override public Row call(Row row) throws

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Umesh Kacha
ark >>> I end up with 4 * 1000 * 200 = 8 small small files in HDFS which >>> wont be >>> good for HDFS name node I have been told if you keep on creating such >>> large >>> no of small small files namenode will crash is it true? please help me >&g

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat
for HDFS name node I have been told if you keep on creating such >> large >> no of small small files namenode will crash is it true? please help me >> understand. Anyways so to avoid creating small files I did set >> spark.sql.shuffle.partitions=1 it seems to be creating 1 ou

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat
all small files namenode will crash is it true? please help me > understand. Anyways so to avoid creating small files I did set > spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as > per my understanding because of only one output there is so much shuffling > to

spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-19 Thread unk1102
wont be good for HDFS name node I have been told if you keep on creating such large no of small small files namenode will crash is it true? please help me understand. Anyways so to avoid creating small files I did set spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as per my

Re: mllib kmeans produce 1 large and many extremely small clusters

2015-08-10 Thread sooraj
> I tried running mllib k-means with 20newsgroups data set from sklearn. On a > 5000 document data set I get one cluster with most of the documents and > other clusters just have handful of documents. > > #code > newsgroups_train = > fetch_20newsgroups(subset='tr

mllib kmeans produce 1 large and many extremely small clusters

2015-08-09 Thread farhan
I tried running mllib k-means with 20newsgroups data set from sklearn. On a 5000 document data set I get one cluster with most of the documents and other clusters just have handful of documents. #code newsgroups_train = fetch_20newsgroups(subset='train',random_state=1,remove=('hea

Re: Spark same execution time on 1 node and 5 nodes

2015-07-18 Thread Gylfi
more parts before line 52 by calling "rddname".repartition(10) for example and see if it runs faster.. Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-same-execution-time-on-1-node-and-5-nodes-tp23866p23893.html Sen

dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem work for me

2015-07-10 Thread kachau
ate='2015-05-22') select from sourcetbl bla bla") //above query creates orc file at /user/db/a1/20-05-22 // I want only one part-0 file at the end of above query so I tried the following and none worked drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).save

Re: DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-09 Thread ashishdutt
Not really a clean solution but I solved the problem by reinstalling Anaconda -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DLL-load-failed-1-is-not-a-valid-win32-application-on-invoking-pyspark-tp23733p23743.html Sent from the Apache Spark User List

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread ashishdutt
Hi, I get the error, "DLL load failed: %1 is not a valid win32 application" whenever I invoke pyspark. Attached is the screenshot of the same. Is there any way I can get rid of it. Still being new to PySpark and have had, a not so pleasant experience so far most probably because

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread Ashish Dutt
Hi, I get the error, "DLL load failed: %1 is not a valid win32 application" whenever I invoke pyspark. Attached is the screenshot of the same. Is there any way I can get rid of it. Still being new to PySpark and have had, a not so pleasant experience so far most probably because

binaryFiles() for 1 million files, too much memory required

2015-07-02 Thread Kostas Kougios
Once again I am trying to read a directory tree using binary files. My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e. ROOTDIR/1 ROOTDIR/2 ROOTDIR/.. ROOTDIR/100 A total of 1 mil files split into 100 sub dirs Using binaryFiles requires too much memory on the

binaryFiles() for 1 million files, too much memory required

2015-07-01 Thread Konstantinos Kougios
Once again I am trying to read a directory tree using binary files. My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e. ROOTDIR/1 ROOTDIR/2 ROOTDIR/.. ROOTDIR/100 A total of 1 mil files split into 100 sub dirs Using binaryFiles requires too much memory on

Re: Unable to use more than 1 executor for spark streaming application with YARN

2015-06-17 Thread Saiph Kappa
How can I get more information regarding this exception? On Wed, Jun 17, 2015 at 1:17 AM, Saiph Kappa wrote: > Hi, > > I am running a simple spark streaming application on hadoop 2.7.0/YARN > (master: yarn-client) with 2 executors in different machines. However, > while the ap

<    1   2   3   4   >