How should I interpret Spark's toDebugString()?

2016-06-01 Thread dimoes
This may be a dumb question, but can someone explain how I should interpret this DAG? Please ignore the types of RDDs. Specifically, what do 2 parallel lines (e.g. B) mean? What does a single parallel line with no previous single parallel line (e.g.: A) signify? (N) MappedRDD[3] |

Re: StackOverflow in Spark

2016-06-01 Thread Rishabh Wadhawan
Stackoverflow is generated when DAG is too log as there are many transformations in lot of iterations. Please use checkpointing to store the DAG and break the linage to get away from this stack overflow error. Look into checkpoint fuction. Thanks Hope it helps. Let me know if you need anymore

Re: StackOverflow in Spark

2016-06-01 Thread Yash Sharma
Not sure if its related, But I got a similar stack overflow error some time back while reading files and converting them to parquet. > Stack trace- > 16/06/02 02:23:54 INFO YarnAllocator: Driver requested a total number of > 32769 executor(s). > 16/06/02 02:23:54 INFO ExecutorAllocationManager:

Re: StackOverflow in Spark

2016-06-01 Thread Matthew Young
Hi, It's related to the one fixed bug in Spark, jira ticket SPARK-6847 Matthew Yang On Wed, May 25, 2016 at 7:48 PM, Michel Hubert wrote: > > > Hi, > > > > > > I have an Spark application which generates StackOverflowError

Re: Spark input size when filtering on parquet files

2016-06-01 Thread Takeshi Yamamuro
Technically, yes. I'm not sure there is a parquet api for easily catching file statistics (min, max, ...) though, if it exists, it seems we could skip some file splits in `ParquetFileFormat`.

--driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-01 Thread Jacek Laskowski
Hi, I'm reviewing the code of spark-submit and can see that although --driver-cores is said to be for Standalone and YARN only, it is applicable for Mesos [1]. ➜ spark git:(master) ✗ ./bin/spark-shell --help Usage: ./bin/spark-shell [options] ... Spark standalone with cluster deploy mode only:

Re: Spark input size when filtering on parquet files

2016-06-01 Thread Dennis Hunziker
Thanks, that makes sense. What I wonder though is that if we use parquet meta data caching, spark should be able to execute queries much faster when using a large amount of smaller .parquet files compared to a smaller amount of large ones. At least as long as the min/max indexing is efficient

Re: Spark streaming readind avro from kafka

2016-06-01 Thread Mohammad Tariq
Hi Neeraj, You might find Kafka-Direct useful. BTW, are you using something like Confluent for you Kafka setup. If that's the case you might leverage Schema registry to get hold of the associated schema without additional

Re: Spark streaming readind avro from kafka

2016-06-01 Thread Igor Berman
Avro file contains metadata with schema(writer schema) in Kafka there is no such thing, you should put message that will contain some reference to known schema(put whole schema will have big overhead) some people use schema registry solution On 1 June 2016 at 21:02, justneeraj

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
Option should place nicely with encoders, but its always possible there are bugs. I think those function signatures are slightly more expensive (one extra object allocation) and its not as java friendly so we probably don't want them to be the default. That said, I would like to enable that kind

get and append file name in record being reading

2016-06-01 Thread Vikash Kumar
How I can get the file name of each record being reading? suppose input file ABC_input_0528.txt contains 111,abc,234 222,xyz,456 suppose input file ABC_input_0531.txt contains 100,abc,299 200,xyz,499 and I need to create one final output with file name in each record using dataframes my output

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
Ah thanks, I missed seeing the PR for https://issues.apache.org/jira/browse/SPARK-15441. If the rows became null objects then I can implement methods that will map those back to results that align closer to the RDD interface. As a follow on, I'm curious about thoughts regarding enriching the

Re: ImportError: No module named numpy

2016-06-01 Thread Bhupendra Mishra
I have numpy installed but where I should setup PYTHONPATH? On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández wrote: > sudo pip install numpy > > On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra < > bhupendra.mis...@gmail.com> wrote: > >> Thanks . >> How can this be

Re: ImportError: No module named numpy

2016-06-01 Thread Sergio Fernández
sudo pip install numpy On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra wrote: > Thanks . > How can this be resolved? > > On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau wrote: > >> Generally this means numpy isn't installed on the system or your

Re: Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
It seems that to join a DStream with a RDD I can use : mgs.transform(rdd => rdd.join(rdd1)) or mgs.foreachRDD(rdd => rdd.join(rdd1)) But, I can't see why rdd1.toDF("id","aid") really causes SPARK-5063 > On Jun 1, 2016, at 12:00, Cyril Scetbon wrote: > > Hi guys, >

Re: Spark streaming readind avro from kafka

2016-06-01 Thread justneeraj
+1 I am trying to read avro from kafka and I don't want to limit to a small set of schema. So I want to dynamically load the schema from avro file (as avro contains schema as well). And then from this I want to create a dataframe and run some queries on that. Any help would be really thankful.

Re: Saprk 1.6 Driver Memory Issue

2016-06-01 Thread kali.tumm...@gmail.com
Hi , I am using spark-sql shell wile launching I am running it as spark-sql --conf spark.driver.maxResultSize=20g I tried using spark-sql --conf "spark.driver.maxResults"="20g" but still no luck do I need to use set command something like spark-sql --conf set "spark.driver.maxReults"="20g"

How to enable core dump in spark

2016-06-01 Thread prateek arora
Hi I am using cloudera to setup spark 1.6.0 on ubuntu 14.04 . I set core dump limit to unlimited in all nodes . Edit /etc/security/limits.conf file and add " * soft core unlimited " line. i rechecked using : $ ulimit -all core file size (blocks, -c) unlimited data seg size

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
Thanks for the feedback. I think this will address at least some of the problems you are describing: https://github.com/apache/spark/pull/13425 On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher wrote: > Hi, > > I've been working on transitioning from RDD to Datasets in

Re: Map tuple to case class in Dataset

2016-06-01 Thread Michael Armbrust
That error looks like its caused by the class being defined in the repl itself. $line29.$read$ is the name of out outer object that is being used to compile the line containing case class Test(a: Int). Is this EMR or the Apache 1.6.1 release? On Wed, Jun 1, 2016 at 8:05 AM, Tim Gautier

Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
Hi, I've been working on transitioning from RDD to Datasets in our codebase in anticipation of being able to leverage features of 2.0. I'm having a lot of difficulties with the impedance mismatches between how outer joins worked with RDD versus Dataset. The Dataset joins feel like a big step

Re: Saprk 1.6 Driver Memory Issue

2016-06-01 Thread ashesh_28
Hi Karthik , You must set the value before the SparkContext (sc) is created. Also don't assign too much overhead like 20g for maxResultSize , You can set it to 2G maximum as per your error message. Also if you are using Java 1.8 , Please add the below section in your Yarn-site.xml

Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
Hi guys, I have a 2 input data streams that I want to join using Dataframes and unfortunately I get the message produced by https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1 in (2) : (1) val rdd1 = sc.esRDD(es_resource.toLowerCase, query) .map(r

Re: ImportError: No module named numpy

2016-06-01 Thread Bhupendra Mishra
Thanks . How can this be resolved? On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau wrote: > Generally this means numpy isn't installed on the system or your > PYTHONPATH has somehow gotten pointed somewhere odd, > > On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra < >

Re: Saprk 1.6 Driver Memory Issue

2016-06-01 Thread Kishoore MV
Hi Kali, In the shuffle stage maximum memory is 2GB(1024 MB).in your error it is expecting more memory.can you let me know your cluster config details. Thanks & Regards Kishore M > On 01-Jun-2016, at 9:11 PM, "kali.tumm...@gmail.com" > wrote: > > Hi All , > > I am

Re: Map tuple to case class in Dataset

2016-06-01 Thread Tim Gautier
I was getting a warning about /tmp/hive not being writable whenever I started spark-shell, but I was ignoring it. I decided to set the permissions to 777 and restart the shell. After doing that, I now get the same result as Ted Yu when running Seq(1,2).toDS.map(t => Test(t)).show. On Wed, Jun 1,

Saprk 1.6 Driver Memory Issue

2016-06-01 Thread kali.tumm...@gmail.com
Hi All , I am getting spark driver memory issue even after overriding the conf by using --conf spark.driver.maxResultSize=20g and I also mentioned in my sql script (set spark.driver.maxResultSize =16;) but still the same error happening. Job aborted due to stage failure: Total size of

Symbolic links in Spark

2016-06-01 Thread Marco1982
Hi all, It seems to me that Spark Streaming doesn't read symbolic links. Do you confirm that? Marco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Symbolic-links-in-Spark-tp27062.html Sent from the Apache Spark User List mailing list archive at

Re: ImportError: No module named numpy

2016-06-01 Thread Holden Karau
Generally this means numpy isn't installed on the system or your PYTHONPATH has somehow gotten pointed somewhere odd, On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra wrote: > If any one please can help me with following error. > > File >

ImportError: No module named numpy

2016-06-01 Thread Bhupendra Mishra
If any one please can help me with following error. File "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 25, in ImportError: No module named numpy Thanks in advance!

Re: Map tuple to case class in Dataset

2016-06-01 Thread Tim Gautier
I spun up another EC2 cluster today with Spark 1.6.1 and I still get the error. scala> case class Test(a: Int) defined class Test scala> Seq(1,2).toDS.map(t => Test(t)).show 16/06/01 15:04:21 WARN scheduler.TaskSetManager: Lost task 39.0 in stage 0.0 (TID 39,

Best Practices for Spark Join

2016-06-01 Thread Aakash Basu
Hi, Can you please write in order of importance, one by one, the Best Practices (necessary/better to follow) for doing a Spark Join. Thanks, Aakash.

Re: Switching broadcast mechanism from torrrent

2016-06-01 Thread Ted Yu
I found spark.broadcast.blockSize but no parameter to switch broadcast method. Can you describe the issues with torrent broadcast in more detail ? Which version of Spark are you using ? Thanks On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > Our

Switching broadcast mechanism from torrrent

2016-06-01 Thread Daniel Haviv
Hi, Our application is failing due to issues with the torrent broadcast, is there a way to switch to another broadcast method ? Thank you. Daniel

Re: Windows Rstudio to Linux spakR

2016-06-01 Thread Sun Rui
Selvam, First, deploy the Spark distribution on your Windows machine, which is of the same version of Spark in your Linux cluster Second, follow the instructions at https://github.com/apache/spark/tree/master/R#using-sparkr-from-rstudio. Specify the Spark master URL for your Linux Spark

Re: Spark Job Execution halts during shuffle...

2016-06-01 Thread Priya Ch
Hi Can someone throw light on this. The issue is not frquently happening. Sometimes the job halts with the above messages. Regards, Padma Ch On Fri, May 27, 2016 at 8:47 AM, Ted Yu wrote: > Priya: > Have you checked the executor logs on hostname1 and hostname2 ? > > Cheers

Ignore features in Random Forest

2016-06-01 Thread Neha Mehta
Hi, I am performing Regression using Random Forest. In my input vector, I want the algorithm to ignore certain columns/features while training the classifier and also while prediction. These are basically Id columns. I checked the documentation and could not find any information on the same.

Re: hivecontext and date format

2016-06-01 Thread Mich Talebzadeh
Try this SELECT TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd')) AS paymentdate FROM HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

hivecontext and date format

2016-06-01 Thread pseudo oduesp
Hi , can i ask you how we can convert string like dd/mm/ to date type in hivecontext? i try with unix_timestemp and with format date but i get null . thank you.

Shuffle Service - Connection Inactive - Creating new one

2016-06-01 Thread krishmah
I am seeing this in my logs. It appears to reopen connection to Shuffle Service. Whenever this happens, I am seeing the partition to take longer time to complete. I was running a job with 1000 partitions. About 600 partitions, it was completing in less than 20 mins a partition and after that I

Windows Rstudio to Linux spakR

2016-06-01 Thread Selvam Raman
Hi , How to connect to sparkR (which is available in Linux env) using Rstudio(Windows env). Please help me. -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Re: Spark Twitter Stream throws Null Pointer Exception

2016-06-01 Thread Mich Talebzadeh
have you checked yarn errorlogs? resourcemanager and nodemanager logs? what do they say. It is possible that in the cluster mode you have not set up /tmp directories properly. Has anything else working in yarn-client mode? HTH Dr Mich Talebzadeh LinkedIn *

Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-06-01 Thread Alonso Isidoro Roman
Thank you David, i will try to follow your advise. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-05-31 21:28 GMT+02:00 David Newberger

Spark Twitter Stream throws Null Pointer Exception

2016-06-01 Thread mayankshete
Hello Team, Can anyone tell why the below code is throwing Null Pointer Execption in yarn-client mode whereas running on local mode. / val filters = args.takeRight(0) val sparkConf = new SparkConf().setAppName("TwitterAnalyzer") val ssc = new StreamingContext(sparkConf,

Re: Accessing s3a files from Spark

2016-06-01 Thread Gourav Sengupta
Hi, I am sorry, I do read this https://wiki.apache.org/hadoop/AmazonS3 which mentions about s3:// being deprecated. From what I read using s3a is the preferred way to go. Ofcourse, I have been using it for writing data from SPARK but not for reading yet. Let me try that and come back. Regards,