Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Andy Dang
Can't you just load the data from HBase first, and then call sc.parallelize on your dataset? -Andy --- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > When calling sc.parallelize(data,1), is there a preference

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
That's exactly what I am doing, but my question is does parallelize send the data to a worker node. From a performance perspective on small sets, the ideal would be to load in local jvm memory of the driver. I mean even designating the current machine as a worker node, besides driver, would

Metadata in Parquet

2015-09-30 Thread Philip Weaver
Hi, I am using org.apache.spark.sql.types.Metadata to store extra information along with each of my fields. I'd also like to store Metadata for the entire DataFrame, not attached to any specific field. Is this supported? - Philip

Re: Hive permanent functions are not available in Spark SQL

2015-09-30 Thread Pala M Muthaia
+user list On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia wrote: > Hi, > > I am trying to use internal UDFs that we have added as permanent functions > to Hive, from within Spark SQL query (using HiveContext), but i encounter > NoSuchObjectException, i.e. the

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Marcelo Vanzin
If you want to process the data locally, why do you need to use sc.parallelize? Store the data in regular Scala collections and use their methods to process them (they have pretty much the same set of methods as Spark RDDs). Then when you're happy, finally use Spark to process the pre-processed

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi, An equivalent question would be: can the memory cache be selectively evicted from within a component run in the driver? I know it is breaking some abstraction/encapsulation, but clearly I need to evict part of the cache so that it is reloaded with newer values from DB. Because what I

Re: Metadata in Parquet

2015-09-30 Thread Cheng Lian
Unfortunately this isn't supported at the moment https://issues.apache.org/jira/browse/SPARK-10803 Cheng On 9/30/15 10:54 AM, Philip Weaver wrote: Hi, I am using org.apache.spark.sql.types.Metadata to store extra information along with each of my fields. I'd also like to store Metadata for

Fetching Date value from RDD of type spark.sql.row

2015-09-30 Thread satish chandra j
HI All, Currently using Spark 1.2.2, as getDate method is not defined in *Public Class Row* for this Spark version hence trying to fetch Date value of a specific coulmn using *get* method as specified in API docs as mentioned below:

Re: Hive ORC Malformed while loading into spark data frame

2015-09-30 Thread Umesh Kacha
Hi Zang thanks much please find the code below Working code loading data from a path created by Hive table using hive console outside of spark : DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/hive/table/partition") Not working code inside spark hive tables created using

Re: UnknownHostException with Mesos and custom Jar

2015-09-30 Thread Akhil Das
Can you try replacing your code with the hdfs uri? like: sc.textFile("hdfs://...").collect().foreach(println) Thanks Best Regards On Tue, Sep 29, 2015 at 1:45 AM, Stephen Hankinson wrote: > Hi, > > Wondering if anyone can help me with the issue I am having. > > I am

Re: Reading kafka stream and writing to hdfs

2015-09-30 Thread Akhil Das
Like: counts.saveAsTestFiles("hdfs://host:port/some/location") Thanks Best Regards On Tue, Sep 29, 2015 at 2:15 AM, Chengi Liu wrote: > Hi, > I am going thru this example here: > >

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Saisai Shao
As I remembered you don't need to upload application jar manually, Spark will do it for you when you use Spark submit. Would you mind posting out your command of Spark submit? On Wed, Sep 30, 2015 at 3:13 PM, Christophe Schmitz wrote: > Hi there, > > I am trying to use the

Re: Setting Spark TMP Directory in Cluster Mode

2015-09-30 Thread mufy
Any takers? :-) --- Mufeed Usman My LinkedIn | My Social Cause | My Blogs : LiveJournal On Mon, Sep 28, 2015 at 10:19 AM, mufy wrote: > Hello Akhil,

Re: Spark thrift service and Hive impersonation.

2015-09-30 Thread Steve Loughran
On 30 Sep 2015, at 03:24, Mohammed Guller > wrote: Does each user needs to start own thrift server to use it? No. One of the benefits of the Spark Thrift Server is that it allows multiple users to share a single SparkContext. Most likely,

Re: log4j Spark-worker performance problem

2015-09-30 Thread Akhil Das
Depends how big the lines are, on a typical HDD you can write at max 10-15MB/s, and on SSDs it can be upto 30-40MB/s. Thanks Best Regards On Mon, Sep 28, 2015 at 3:57 PM, vaibhavrtk wrote: > Hello > > We need a lot of logging for our application about 1000 lines needed to

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Akhil Das
Each Json Doc should be in a single line i guess. http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Note that the file that is offered as *a json file* is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a

Query about checkpointing time

2015-09-30 Thread jatinganhotra
Hi, I started doing the amp-camp 5 exercises . I tried the following 2 scenarios: *Scenario #1* val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoint pagecounts.count *Scenario #2* val pagecounts =

Partition Column in JDBCRDD or Datasource API

2015-09-30 Thread satish chandra j
HI All, Please provide your inputs on Partition Column to be used in DataSourceAPI or JDBCRDD in a scenerio where the source table does not have a Numeric Columns which is sequential and unique such that proper partitioning can take place in Spark Regards, Satish

Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Christophe Schmitz
Hi there, I am trying to use the "--deploy-mode cluster" option to submit my job (spark 1.4.1). When I do that, the spark-driver (on the cluster) is looking for my application jar. I can manually copy my application jar on all the workers, but I was wondering if there is a way to submit the

Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Camelia Elena Ciolac
Hello, I am working on a machine learning project, currently using spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention working in Python from an IPython notebook. I face the following

Lost leader exception in Kafka Direct for Streaming

2015-09-30 Thread swetha
Hi, I see this sometimes in our Kafka Direct approach in our Streaming job. How do we make sure that the job recovers from such errors and works normally thereafter? 15/09/30 05:14:18 ERROR KafkaRDD: Lost leader for topic x_stream partition 19, sleeping for 200ms 15/09/30 05:14:18 ERROR

What is the best way to submit multiple tasks?

2015-09-30 Thread Saif.A.Ellafi
Hi all, I have a process where I do some calculations on each one of the columns of a dataframe. Intrinsecally, I run across each column with a for loop. On the other hand, each process itself is non-entirely-distributable. To speed up the process, I would like to submit a spark program for

[streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException

2015-09-30 Thread Alexey Ponkin
Hi I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) - read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra. The problem is that when I increase number of incoming messages in topic the job is starting to fail with

Combine key-value pair in spark java

2015-09-30 Thread Ramkumar V
Hi, I have key value pair of JavaRDD (JavaPairRDD rdd) but i want to concatenate into one RDD String (JavaRDD result ). How can i do that ? What i have to use (map,flatmap)? can anyone please give me the syntax for this in java ? *Thanks*,

Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Camelia Elena Ciolac
Hello, I am working on a machine learning project, currently using spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention working in Python from an IPython notebook. I face the following

Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread camelia
Hello, I am working on a machine learning project, currently using spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention working in Python from an IPython notebook. I face the following

Re: Combine key-value pair in spark java

2015-09-30 Thread Andy Dang
You should be able to use a simple mapping: rdd.map(tuple -> tuple._1() + tuple._2()) --- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 10:34 AM, Ramkumar V wrote: > Hi, > > I have key value pair of JavaRDD (JavaPairRDD rdd) but i > want to

Spark Streaming Standalone 1.5 - Stage cancelled because SparkContext was shut down

2015-09-30 Thread tranan
Hello All, I have several Spark Streaming applications running on Standalone mode in Spark 1.5. Spark is currently set up for dynamic resource allocation. The issue I am seeing is that I can have about 12 Spark Streaming Jobs running concurrently. Occasionally I would see more than half where

Re: Spark thrift service and Hive impersonation.

2015-09-30 Thread Vinay Shukla
Steve is right, The Spark thing server does not profs page end user identity downstream yet. On Wednesday, September 30, 2015, Steve Loughran wrote: > > On 30 Sep 2015, at 03:24, Mohammed Guller

Spark Streaming

2015-09-30 Thread Amith sha
Hi All, I am planning to handle streaming data from kafka to spark Using python code Earlier using my own log files i handled them in spark using INDEX But in case of Apache log I cannot prefer index because by splitting with whitespace, index will be missed so Is that Possible to use regex in

sc.parallelize with defaultParallelism=1

2015-09-30 Thread Nicolae Marasoiu
Hi, When calling sc.parallelize(data,1), is there a preference where to put the data? I see 2 possibilities: sending it to a worker node, or keeping it on the driver program. I would prefer to keep the data local to the driver. The use case is when I need just to load a bit of data from

Re: partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi, In fact, my RDD will get a new version (a new RDD assigned to the same var) quite frequently, by merging bulks of 1000 events of events of last 10s. But recomputation would be more efficient to do not by reading initial RDD partition(s) and reapplying deltas, but by reading from HBase the

Re: Combine key-value pair in spark java

2015-09-30 Thread Ramkumar V
Thanks man. Its works for me. *Thanks*, On Wed, Sep 30, 2015 at 4:31 PM, Andy Dang wrote: > You should be able to use a simple mapping: > > rdd.map(tuple -> tuple._1() + tuple._2()) > > --- > Regards, > Andy (Nam) Dang > > On

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Ted Yu
See the tail of this: https://bugzilla.redhat.com/show_bug.cgi?id=1005811 FYI > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg > wrote: > > Is there a way to ensure Spark doesn't write to /tmp directory? > > We've got spark.local.dir specified in the

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Thanks, Ted, will try it out. On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu wrote: > See the tail of this: > https://bugzilla.redhat.com/show_bug.cgi?id=1005811 > > FYI > > > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg > wrote: > > > > Is there a way

How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Is there a way to ensure Spark doesn't write to /tmp directory? We've got spark.local.dir specified in the spark-defaults.conf file to point at another directory. But we're seeing many of these snappy-unknown-***-libsnappyjava.so files being written to /tmp still. Is there a config setting or

partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi, If I implement a manner to have an up-to-date version of my RDD by ingesting some new events, called RDD_inc (from increment), and I provide a "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve the state of my RDD by constructing new RDDs all the time,

Problem understanding spark word count execution

2015-09-30 Thread Kartik Mathur
Hi All, I tried running spark word count and I have couple of questions - I am analyzing stage 0 , i.e *sc.textFile -> flatMap -> Map (Word count example)* 1) In the *Stage logs* under Application UI details for every task I am seeing Shuffle write as 2.7 KB, *question - how can I know where

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Christophe Schmitz
Hi Saisai I am using this command: spark-submit --deploy-mode cluster --properties-file file.conf --class myclass test-assembly-1.0.jar The application start only if I manually copy test-assembly-1.0.jar in all the worer (or the master, I don't remember) and provide the full path of the file.

Kafka Direct Stream

2015-09-30 Thread Udit Mehta
Hi, I am using spark direct stream to consume from multiple topics in Kafka. I am able to consume fine but I am stuck at how to separate the data for each topic since I need to process data differently depending on the topic. I basically want to split the RDD consisting on N topics into N RDD's

Worker node timeout exception

2015-09-30 Thread markluk
I setup a new Spark cluster. My worker node is dying with the following exception. Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Michael Armbrust
I think the problem here is that you are passing in parsed JSON that stored as a dictionary (which is converted to a hashmap when going into the JVM). You should instead be passing in the path to the json file (formatted as Akhil suggests) so that Spark can do the parsing in parallel. The other

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-09-30 Thread Saisai Shao
Are you running on standalone deploy mode, what Spark version are you running? Can you explain a little more specifically what exception occurs, how to provide the jar to Spark? I tried in my local machine with command: ./bin/spark-submit --verbose --master spark://hw12100.local:7077

Re: Problem understanding spark word count execution

2015-09-30 Thread Nicolae Marasoiu
Hi, 2- the end results are sent back to the driver; the shuffles are transmission of intermediate results between nodes such as the -> which are all intermediate transformations. More precisely, since flatMap and map are narrow dependencies, meaning they can usually happen on the local node,

Re: [streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException

2015-09-30 Thread Cody Koeninger
Offset out of range means the message in question is no longer available on Kafka. What's your kafka log retention set to, and how does that compare to your processing time? On Wed, Sep 30, 2015 at 4:26 AM, Alexey Ponkin wrote: > Hi > > I have simple spark-streaming job(8

Re: RandomForestClassifer does not recognize number of classes, nor can number of classes be set

2015-09-30 Thread Yanbo Liang
Hi Kristina, Currently StringIndexer is a requirement step before training DecisionTree, RandomForest and GBT related models. Though it does not necessary by other models such as LogisticRegression and NaiveBayes, it also strongly recommend to make this preprocessing step otherwise it may lead

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Ted Yu
bq. have tried these settings with the hbase protocol jar, to no avail In that case, HBaseZeroCopyByteString is contained in hbase-protocol.jar. In HBaseZeroCopyByteString , you can see: package com.google.protobuf; // This is a lie. If protobuf jar is loaded ahead of hbase-protocol.jar,

RE: Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Ewan Leith
Try reducing the number of workers to 2, and increasing their memory up to 6GB. However I've seen mention of a bug in the pyspark API for when calling head() on a dataframe in spark 1.5.0 and 1.4, it's got a big performance hit. https://issues.apache.org/jira/browse/SPARK-10731 It's fixed in

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Dmitry Goldenberg
I believe I've had trouble with --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true before, so these might not work... I was thinking of trying to add the solr4j jar to spark.executor.extraClassPath... On Wed, Sep 30, 2015 at 12:01 PM, Ted Yu

Re: UnknownHostException with Mesos and custom Jar

2015-09-30 Thread Akhil Das
That's strange, for some reason your hadoop configurations are not picked up by spark. Thanks Best Regards On Wed, Sep 30, 2015 at 9:11 PM, Stephen Hankinson wrote: > When I use hdfs://affinio/tmp/Input it gives the same error about > UnknownHostException affinio. > >

New spark meetup

2015-09-30 Thread Yogesh Mahajan
Hi, Can you please get this new spark meetup listed on the spark community page - http://spark.apache.org/community.html#events Here is a link for the meetup in Pune, India : http://www.meetup.com/Pune-Apache-Spark-Meetup/ Thanks, Yogesh Sent from my iPhone

Re: unsubscribe

2015-09-30 Thread Richard Hillegas
Hi Sukesh, To unsubscribe from the dev list, please send a message to dev-unsubscr...@spark.apache.org. To unsubscribe from the user list, please send a message user-unsubscr...@spark.apache.org. Please see: http://spark.apache.org/community.html#mailing-lists. Thanks, -Rick sukesh kumar