Exposing dataframe via thrift server

2016-03-30 Thread ram kumar
Hi, I started thrift server cd $SPARK_HOME ./sbin/start-thriftserver.sh Then, jdbc client $ ./bin/beeline Beeline version 1.5.2 by Apache Hive beeline>!connect jdbc:hive2://ip:1 show tables; ++--+--+ | tableName | isTemporary | ++--+--+ |

Unable to Run Spark Streaming Job in Hadoop YARN mode

2016-03-30 Thread Soni spark
Hi All, I am unable to run Spark Streaming job in my Hadoop Cluster, its behaving unexpectedly. When i submit a job, it fails by throwing some socket exception in HDFS, if i run the same job second or third time, it runs for sometime and stops. I am confused. Is there any configuration in

Adding Recurrent Neural Network to Spark pipeline.

2016-03-30 Thread Thamali Wijewardhana
Hi all, I have created a program to use Recurrent neural networks for sentiment analysis. This program is created based on Deeplearning4j library. This programs runs fine within a short time. Then I added the above program built using deeplearning4j library to Spark pipeline and created a

Re: sqlContext.cacheTable + yarn client mode

2016-03-30 Thread Jeff Zhang
The table data is cached in block managers on executors. Could you paste the log on your driver about OOM ? On Thu, Mar 31, 2016 at 1:24 PM, Soam Acharya wrote: > Hi folks, > > I understand that invoking sqlContext.cacheTable("tableName") will load > the table into a

sqlContext.cacheTable + yarn client mode

2016-03-30 Thread Soam Acharya
Hi folks, I understand that invoking sqlContext.cacheTable("tableName") will load the table into a compressed in-memory columnar format. When Spark is launched via spark shell in YARN client mode, is the table loaded into the local Spark driver process in addition to the executors in the Hadoop

SparkSQL Dataframe : partitionColumn, lowerBound, upperBound, numPartitions in context of reading from MySQL

2016-03-30 Thread Soumya Simanta
I'm trying to understand what the following configurations mean and their implication on reading data from a MySQL table. I'm looking for options that will impact my read throughput when reading data from a large table. Thanks. partitionColumn, lowerBound, upperBound, numPartitions These

How to design the input source of spark stream

2016-03-30 Thread kramer2...@126.com
Hi My environment is described like below: 5 nodes, each nodes generate a big csv file every 5 minutes. I need spark stream to analyze these 5 files in every five minutes to generate some report. I am planning to do it in this way: 1. Put those 5 files into HDSF directory called /data 2. Merge

Re: aggregateByKey on PairRDD

2016-03-30 Thread write2sivakumar@gmail
Hi, We can use CombineByKey to achieve this. val finalRDD = tempRDD.combineByKey((x: (Any, Any)) => (x),(acc: (Any, Any), x) => (acc, x),(acc1: (Any, Any), acc2: (Any, Any)) => (acc1, acc2)) finalRDD.collect.foreach(println) (amazon,((book1, tech),(book2,tech)))(barns, (book,tech))(eBay,

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-30 Thread Russell Jurney
Actually, I can imagine a one or two line fix for this bug: call row.asDict() inside a wrapper for DataFrame.rdd. Probably deluding myself this could be so easily resolved? :) On Wed, Mar 30, 2016 at 6:10 PM, Russell Jurney wrote: > Thanks to some excellent work by

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-30 Thread Russell Jurney
Thanks to some excellent work by Luke Lovett, we have confirmed this is a bug. DataFrame.rdds are not the same as normal RDDs, they are serialized differently. It may just be unsupported functionality in PySpark. If that is the case, I think this should be added/fixed soon. The bug is here:

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
Hi Mich, I forgot to mention that - this is the ugly part - the source data provider gives us (Windows) pkzip compressed files. Will spark uncompress these automatically? I haven’t been able to make it work. Thanks, Ben > On Mar 30, 2016, at 2:27 PM, Mich Talebzadeh

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Mich Talebzadeh
Hi Ben, Well I have done it for standard csv files downloaded from spreadsheets to staging directory on hdfs and loaded from there. First you may not need to unzip them. dartabricks can read them (in my case) and zipped files. Check this. Mine is slightly different from what you have, First I

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
Hi Mich, You are correct. I am talking about the Databricks package spark-csv you have below. The files are stored in s3 and I download, unzip, and store each one of them in a variable as a string using the AWS SDK (aws-java-sdk-1.10.60.jar). Here is some of the code. val filesRdd =

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Mich Talebzadeh
just to clarify are you talking about databricks csv package. $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 Where are these zipped files? Are they copied to a staging directory in hdfs? HTH Dr Mich Talebzadeh LinkedIn *

Re: pyspark read json file with high dimensional sparse data

2016-03-30 Thread Michael Armbrust
You can force the data to be loaded as a sparse map assuming the key/value types are consistent. Here is an example . On Wed, Mar 30,

Spark 1.5.2 Master OOM

2016-03-30 Thread Yong Zhang
Hi, Sparkers Our cluster is running Spark 1.5.2 with Standalone mode. It runs fine for weeks, but today, I found out the master crash due to OOM. We have several ETL jobs runs daily on Spark, and adhoc jobs. I can see the "Completed Applications" table grows in the master UI. Original I set

pyspark read json file with high dimensional sparse data

2016-03-30 Thread Yavuz Nuzumlalı
Hi all, I'm trying to read a data inside a json file using `SQLContext.read.json()` method. However, reading operation does not finish. My data is of 29x3100 dimensions, but it's actually really sparse, so if there is a way to directly read json into a sparse dataframe, it would work perfect

Re: Unable to set cores while submitting Spark job

2016-03-30 Thread Mich Talebzadeh
Hi Ted Can specify the core as follows for example 12 cores?: val conf = new SparkConf(). setAppName("ImportStat"). *setMaster("local[12]").* set("spark.driver.allowMultipleContexts", "true"). set("spark.hadoop.validateOutputSpecs", "false") val sc = new

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Michael Segel
It sounds like when you start up spark, its using 0.0.0.0 which means it will listen on all interfaces. You should be able to limit which interface to use. The weird thing is that if you are specifying the IP Address and Port, Spark shouldn’t be listening on all of the interfaces for that

Re: Plot DataFrame with matplotlib

2016-03-30 Thread Yavuz Nuzumlalı
Hi Teng, Thanks for the answer. I've switched to pandas during proof of concept process in order to be able to plot graphs easily. Actually, pandas DataFrame object itself has `plot` methods, so these objects can plot themselves on most cases easily (it uses matplotlib inside). I wonder if

Re: Loading multiple packages while starting spark-shell

2016-03-30 Thread Mustafa Elbehery
Hi, This worked out .. Thanks a lot :) On Wed, Mar 30, 2016 at 4:37 PM Ted Yu wrote: > How did you specify the packages ? > > See the following from > https://spark.apache.org/docs/latest/submitting-applications.html : > > Users may also include any other dependencies by

Re: Loading multiple packages while starting spark-shell

2016-03-30 Thread Ted Yu
How did you specify the packages ? See the following from https://spark.apache.org/docs/latest/submitting-applications.html : Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. On Wed, Mar 30, 2016 at 7:15 AM, Mustafa Elbehery

Re: Checkpoint of DStream joined with RDD

2016-03-30 Thread Lubomir Nerad
Hi Ted, all, do you have any advice regarding my questions in my initial email? I tried Spark 1.5.2 and 1.6.0 without success. The problem seems to be that RDDs use some transient fields which are not restored when they are recovered from checkpoint files. In case of some RDD implementations

Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
I have a quick question. I have downloaded multiple zipped files from S3 and unzipped each one of them into strings. The next step is to parse using a CSV parser. I want to know if there is a way to easily use the spark csv package for this? Thanks, Ben

Loading multiple packages while starting spark-shell

2016-03-30 Thread Mustafa Elbehery
Hi Folks, I am trying to use two Spark packages while working from the shell .. Unfortunately it accepts only one package as parameter and ignore the second. Any suggestion how to work around this ? Regards.

Re: spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-03-30 Thread Ted Yu
Have you tried the following construct ? new OrderedRDDFunctions[K, V, (K, V)](rdd).sortByKey() See core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala On Wed, Mar 30, 2016 at 5:20 AM, Nirav Patel wrote: > Hi, I am trying to use filterByRange feature of

Re: Unable to set cores while submitting Spark job

2016-03-30 Thread Ted Yu
-c CORES, --cores CORES Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker bq. sc.getConf().set() I think you should use this pattern (shown in https://spark.apache.org/docs/latest/spark-standalone.html): val conf = new SparkConf()

Re: Configuring log4j Spark

2016-03-30 Thread Guillermo Ortiz
I changed the place of --files and works. ( IT DOESN'T WORK) spark-submit --conf spark.metrics.conf=metrics.properties --name "myProject" --master yarn-cluster --class myCompany.spark.MyClass *--files /opt/myProject/conf/log4j.properties* --jars $SPARK_CLASSPATH --executor-memory 1024m

Configuring log4j Spark

2016-03-30 Thread Guillermo Ortiz
I'm trying to configure log4j in Spark. spark-submit --conf spark.metrics.conf=metrics.properties --name "myProject" --master yarn-cluster --class myCompany.spark.MyClass *--files /opt/myProject/conf/log4j.properties* --jars $SPARK_CLASSPATH --executor-memory 1024m --num-executors 5

Re: Cached Parquet file paths problem

2016-03-30 Thread psmolinski
bumping up the topic. For the moment I stay with 1.5.2, but I would like to switch to 1.6.x and this issue is a blocker. Thanks, Piotr -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cached-Parquet-file-paths-problem-tp26576p26637.html Sent from the

Unable to set cores while submitting Spark job

2016-03-30 Thread vetal king
Hi all, While submitting Spark Job I am am specifying options --executor-cores 1 and --driver-cores 1. However, when the job was submitted, the job used all available cores. So I tried to limit the cores within my main function sc.getConf().set("spark.cores.max", "1"); however it still

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread David O'Gwynn
Thanks much, Akhil. iptables is certainly a bandaid, but from an OpSec perspective, it's troubling. Is there any way to limit which interfaces the WebUI listens on? Is there a Jetty configuration that I'm missing? Thanks again for your help, David On Wed, Mar 30, 2016 at 2:25 AM, Akhil Das

spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-03-30 Thread Nirav Patel
Hi, I am trying to use filterByRange feature of spark OrderedRDDFunctions in a hope that it will speed up filtering by scanning only required partitions. I have created Paired RDD with a RangePartitioner in one scala class and in another class I am trying to access this RDD and do following: In

Trouble facing with Timestamp column in SparkR when loading CSV file from S3

2016-03-30 Thread ps30
I am loading CSV file from S3 bucket in RStudio which is running on EC2 cluster into Spark Data Frame using read.df() function. One of the columns is the Timestamp column. The Timestamp column is being loaded as String. When I try to convert the Data type to Timestamp using Cast function, all

Re: Spark Streaming UI duration numbers mismatch

2016-03-30 Thread Jatin Kumar
Hello Jean, Were you able to reproduce the problem? I couldn't find any documentation if the two numbers have different meaning. -- Thanks Jatin On Thu, Mar 24, 2016 at 1:43 AM, Jean-Baptiste Onofré wrote: > Hi Jatin, > > I will reproduce tomorrow and take a look. > > Did

Re: aggregateByKey on PairRDD

2016-03-30 Thread Daniel Haviv
Hi, shouldn't groupByKey be avoided ( https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) ? Thank you,. Daniel On Wed, Mar 30, 2016 at 9:01 AM, Akhil Das wrote: > Isn't it what

Checkpoints in Spark

2016-03-30 Thread Guillermo Ortiz
I'm curious about what kind of things are saved in the checkpoints. I just changed the number of executors when I execute Spark and it didn't happen until I remove the checkpoint, I guess that if I'm using log4j.properties and I want to changed I have to remove the checkpoint as well. When you

Re: Unit testing framework for Spark Jobs?

2016-03-30 Thread Lars Albertsson
Thanks! It is on my backlog to write a couple of blog posts on the topic, and eventually some example code, but I am currently busy with clients. Thanks for the pointer to Eventually - I was unaware. Fast exit on exception would be a useful addition, indeed. Lars Albertsson Data engineering

Re: Running Spark on Yarn

2016-03-30 Thread Vineet Mishra
RM NM logs traced below, RM --> 2016-03-30 14:59:15,498 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1459326455972_0004_01_01, NodeId: myhost:60653, NodeHttpAddress: myhost:8042, Resource:

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Steve Loughran
On 30 Mar 2016, at 04:44, Selvam Raman > wrote: Hi, i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. i was trying to use databricks csv format to read csv file. i used the below command. I got null pointer exception. Any

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-30 Thread Eugene Morozov
One more thing. With increased stack size it completed twice more already, but now I see in the log. [dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage 24860 contains a task of very large size (157 KB). The maximum recommended task size is 100 KB. Size of the task

Re: Sending events to Kafka from spark job

2016-03-30 Thread أنس الليثي
Dear Andy, As far as I understand, the transformations are applied to the RDDs not to the data and I need to send the actual data to Kafka. This way, I think I should perform at least one action to make spark load the data. Kindly correct me if I do not understand this the correct way. Best

Re: Spark streaming spilling all the data to disk even if memory available

2016-03-30 Thread Akhil Das
Can you elaborate more on from where you are streaming the data and what type of consumer you are using etc? Thanks Best Regards On Tue, Mar 29, 2016 at 6:10 PM, Mayur Mohite wrote: > Hi, > > We are running spark streaming app on a single machine and we have >

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Akhil Das
In your case, you will be able to see the webui (unless restricted with iptables) but you won't be able to submit jobs to that machine from a remote machine since the spark master is spark://127.0.0.1:7077 Thanks Best Regards On Tue, Mar 29, 2016 at 8:12 PM, David O'Gwynn

Re: Master options Cluster/Client descrepencies.

2016-03-30 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 Thanks Best Regards On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > > Hi All, > > I have written a spark program on my dev box , >IDE:Intellij >

Re: aggregateByKey on PairRDD

2016-03-30 Thread Akhil Das
Isn't it what tempRDD.groupByKey does? Thanks Best Regards On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh wrote: > Hi All, > > I have an RDD having the data in the following form : > > tempRDD: RDD[(String, (String, String))] > > (brand , (product, key)) > >

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Akhil Das
Looks like the winutils.exe is missing from the environment, See https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman wrote: > Hi, > > i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. >

Re: data frame problem preserving sort order with repartition() and coalesce()

2016-03-30 Thread Takeshi Yamamuro
Hi, "csvDF = csvDF.sort(orderByColName, ascending=False)" repartitions DF by using RangePartitioner (#partitions depends on "spark.sql.shuffle.partitions"). Seems, in your case, some empty partitions were removed, then you got 17 paritions. // maropu On Wed, Mar 30, 2016 at 6:49 AM, Andy

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Selvam Raman
Hi, i can able to load and extract the data. only problem when i using this databricks library. thanks, selvam R On Wed, Mar 30, 2016 at 9:33 AM, Hyukjin Kwon wrote: > Hi, > > I guess this is not a CSV-datasource specific problem. > > Does loading any file (eg.