Re: Create RDD from output of unix command

2015-07-18 Thread Gylfi
You may want to look into using the pipe command .. http://blog.madhukaraphatak.com/pipe-in-spark/ http://spark.apache.org/docs/0.6.0/api/core/spark/rdd/PipedRDD.html -- View this message in context:

Re: Passing Broadcast variable as parameter

2015-07-18 Thread Gylfi
Hi. You can use a broadcast variable to make data available to all the nodes in your cluster that can live longer then just the current distributed task. For example if you need a to access a large structure in multiple sub-tasks, instead of sending that structure again and again with each

Re: write a HashMap to HDFS in Spark

2015-07-18 Thread Gylfi
Hi. Assuming your have the data in an RDD you can save your RDD (regardless of structure) with nameRDD.saveAsObjectFile(path) where path can be hdfs:///myfolderonHDFS or the local file system. Alternatively you can also use .saveAsTextFile() Regards, Gylfi. -- View this message in

Re: Spark same execution time on 1 node and 5 nodes

2015-07-18 Thread Gylfi
Hi. If I just look at the two pics, I see that there is only one sub-task that takes all the time.. This is the flatmapToPair at Coef... line 52. I also see that there are only two partitions that make up the input and thus probably only two workers active. Try repartitioning the data into

PicklingError: Could not pickle object as excessively deep recursion required.

2015-07-18 Thread Andrej Burja
hi on windows, in local mode, using pyspark i got an error about excessively deep recursion i'm using some module for lemmatizing/stemming, which uses some dll and some binary files (module is a python wrapper around c code). spark version 1.4.0 any idea what is going on?

Re: Spark APIs memory usage?

2015-07-18 Thread Harit Vishwakarma
Even if I remove numpy calls. (no matrices loaded), Same exception is coming. Can anyone tell what createDataFrame does internally? Are there any alternatives for it? On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I suspect its the numpy filling up Memory. Thanks

Re: Flatten list

2015-07-18 Thread Gylfi
Hi. To be honest I don't really understand your problem declaration :( but lets just talk about how .flatmap works. Unlike .map(), that only allows a one-to-one transformation, .flatmap() allows 0, 1 or many outputs per item processed but the output must take the form of a sequence of the same

Re: Spark and SQL Server

2015-07-18 Thread Davies Liu
I think you have a mistake on call jdbc(), it should be: jdbc(self, url, table, mode, properties) You had use properties as the third parameter. On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T matthew.t.yo...@intel.com wrote: Hello, I am testing Spark interoperation with SQL Server via

Re: Using reference for RDD is safe?

2015-07-18 Thread Gylfi
Hi. All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver

Re: No. of Task vs No. of Executors

2015-07-18 Thread Gylfi
You could even try changing the block size of the input data on HDFS (can be done on a per file basis) and that would get all workers going right from the get-go in Spark. -- View this message in context:

Re: K Nearest Neighbours

2015-07-18 Thread Gylfi
Hi. What I would do in your case would be something like this.. Lets call the two datasets, qs and ds, where qs is an array of vectors and ds is an RDD[(dsID: Long, Vector)]. Do the following: 1) create a k-NN class that can keep track of the k-Nearest Neighbors so far. It must have a qsID

Using Dataframe write with newHdoopApi

2015-07-18 Thread ayan guha
Hi I am trying to use DF and save it to Elasticsearch using newHadoopApi (because I am using python). Can anyone guide me to help if this is even possible? -- Best Regards, Ayan Guha

Re: DataFrame more efficient than RDD?

2015-07-18 Thread Ted Yu
Here is a related thread: http://search-hadoop.com/m/q3RTtPmjSJ1Dod92 On Jul 15, 2015, at 7:41 AM, k0ala k0ala.k0...@gmail.com wrote: Hi, I have been working a bit with RDD, and am now taking a look at DataFrames. The schema definition using case classes looks very attractive;

BigQuery connector for pyspark via Hadoop Input Format example

2015-07-18 Thread lfiaschi
I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-07-18 Thread Naveen Madhire
I am facing the same issue, i tried this but getting compilation error for the $ in the explode function So, I had to modify to the below to make it work. df.select(explode(new Column(entities.user_mentions)).as(mention)) On Wed, Jun 24, 2015 at 2:48 PM, Michael Armbrust

Spark-hive parquet schema evolution

2015-07-18 Thread Jerrick Hoang
Hi all, I'm aware of the support for schema evolution via DataFrame API. Just wondering what would be the best way to go about dealing with schema evolution with Hive metastore tables. So, say I create a table via SparkSQL CLI, how would I deal with Parquet schema evolution? Thanks, J

Re: No. of Task vs No. of Executors

2015-07-18 Thread David Mitchell
This is likely due to data skew. If you are using key-value pairs, one key has a lot more records, than the other keys. Do you have any groupBy operations? David On Tue, Jul 14, 2015 at 9:43 AM, shahid sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so

Re: spark-shell with Yarn failed

2015-07-18 Thread Chester @work
it might be a network issue. The error states failed to bind the server IP address Chester Sent from my iPhone On Jul 18, 2015, at 11:46 AM, Amjad ALSHABANI ashshab...@gmail.com wrote: Does anybody have any idea about the error I m having.. I am really clueless... And appreciate any idea

RE: Feature Generation On Spark

2015-07-18 Thread Mohammed Guller
Try this (replace ... with the appropriate values for your environment): import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector val sc = new SparkContext(...) val documents =

Re: spark-shell with Yarn failed

2015-07-18 Thread Amjad ALSHABANI
Does anybody have any idea about the error I m having.. I am really clueless... And appreciate any idea :) Thanks in advance Amjad On Jul 17, 2015 5:37 PM, Amjad ALSHABANI ashshab...@gmail.com wrote: Hello, First of all I m a newbie in Spark , I m trying to start the spark-shell with yarn

Spark1.4 application throw java.lang.NoClassDefFoundError: javax/servlet/FilterRegistration

2015-07-18 Thread Wwh 吴
hi I have build a spark application with IDEA. when run SparkPI , IDEA throw exception as that : Exception in thread main java.lang.NoClassDefFoundError: javax/servlet/FilterRegistration at org.spark-project.jetty.servlet.ServletContextHandler.init(ServletContextHandler.java:136) at

How to restart Twitter spark stream

2015-07-18 Thread Zoran Jeremic
Hi, I have a twitter spark stream initialized in the following way: val ssc:StreamingContext = SparkLauncher.getSparkScalaStreamingContext() val config = getTwitterConfigurationBuilder.build() val auth: Option[twitter4j.auth.Authorization] = Some(new

[General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

2015-07-18 Thread Mike Frampton
I wanted to ask a general question about Hadoop/Yarn and Apache Spark integration. I know that Hadoop on a physical cluster has rack awareness. i.e. It attempts to minimise network traffic by saving replicated blocks within a rack. i.e. I wondered whether, when Spark is configured to use