Re: Broadcast variables in R

2015-07-21 Thread Shivaram Venkataraman
There shouldn't be anything Mac OS specific about this feature. One point of warning though -- As mentioned previously in this thread the APIs were made private because we aren't sure we will be supporting them in the future. If you are using these APIs it would be good to chime in on the JIRA

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew Or, Yes, NodeManager was restarted, I also checked the logs to see if the JARs appear in the CLASSPATH. I have also downloaded the binary distribution and use the JAR spark-1.4.1-bin-hadoop2.4/lib/spark-1.4.1-yarn-shuffle.jar without success. Has anyone successfully enabled the

RE: Would driver shutdown cause app dead?

2015-07-21 Thread Young, Matthew T
ZhuGe, If you run your program in the cluster deploy-mode you get resiliency against driver failure, though there are some steps you have to take in how you write your streaming job to allow for transparent resume. Netflix did a nice writeup of this resiliency

Re: Spark Streaming Checkpointing solutions

2015-07-21 Thread Dean Wampler
TD's Spark Summit talk offers suggestions ( https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/). He recommends using HDFS, because you get the triplicate resiliency it offers, albeit with extra overhead. I believe the driver doesn't need visibility

Re: Spark MLlib instead of Mahout - collaborative filtering model

2015-07-21 Thread Anas Sherwani
I have never used Mahout, so cannot compare the two. Spark MLlib, however, provides matrix factorization based Collaborative Filtering http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html using Alternating Least Squares algorithm. Also, Singular Value Decomposition is handled

Re: user threads in executors

2015-07-21 Thread Richard Marscher
You can certainly create threads in a map transformation. We do this to do concurrent DB lookups during one stage for example. I would recommend, however, that you switch to mapPartitions from map as this allows you to create a fixed size thread pool to share across items on a partition as opposed

Re: Spark Streaming Checkpointing solutions

2015-07-21 Thread Emmanuel Fortin
Thank you for your reply. I will consider hdfs for the checkpoint storage. Le mar. 21 juil. 2015 à 17:51, Dean Wampler deanwamp...@gmail.com a écrit : TD's Spark Summit talk offers suggestions (

Convert Simple Kafka Consumer to standalone Spark JavaStream Consumer

2015-07-21 Thread Hafsa Asif
Hi, I have a simple High level Kafka Consumer like : package matchinguu.kafka.consumer; import kafka.consumer.Consumer; import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; import

Accumulator value 0 in driver

2015-07-21 Thread dlmarion
I am using Accumulators in a JavaRDDLabeledPoint.flatMap() method. I have logged the localValue() of the Accumulator object and their values are non-zero. In the driver, after the .flatMap() method returns, the calling value() on the Accumulator yields 0. I am running 1.4.0 in yarn-client

SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi I could successfully install SparkR package into my RStudio but I could not execute anything against sc or sqlContext. I did the following: Sys.setenv(SPARK_HOME=/path/to/sparkE1.4.1) .libPaths(c(file.path(Sys.getenv(SPARK_HOME),R,lib),.libPaths())) library(SparkR) Above code installs

Re: Accumulator value 0 in driver

2015-07-21 Thread Ted Yu
Have you called collect() / count() on the RDD following flatMap() ? Cheers On Tue, Jul 21, 2015 at 8:47 AM, dlmar...@comcast.net wrote: I am using Accumulators in a JavaRDDLabeledPoint.flatMap() method. I have logged the localValue() of the Accumulator object and their values are

Re: Timestamp functions for sqlContext

2015-07-21 Thread Romi Kuntsman
Hi Tal, I'm not sure there is currently a built-in function for it, but you can easily define a UDF (user defined function) by extending org.apache.spark.sql.api.java.UDF1, registering it (sparkContext.udf().register(...)), and then use it inside your query. RK. On Tue, Jul 21, 2015 at 7:04

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Shivaram Venkataraman
FWIW I've run into similar BLAS related problems before and wrote up a document on how to do this for Spark EC2 clusters at https://github.com/amplab/ml-matrix/blob/master/EC2.md -- Note that this works with a vanilla Spark build (you only need to link to netlib-lgpl in your App) but requires the

Timestamp functions for sqlContext

2015-07-21 Thread Tal Rozen
Hi, I'm running a query with sql context where one of the fields is of type java.sql.Timestamp. I'd like to set a function similar to DATEDIFF in mysql, between the date given in each row, and now. So If I was able to use the same syntax as in mysql it would be: val date_diff_df =

query over hive context hangs, please help

2015-07-21 Thread 诺铁
The thread dump is here, seems hang on accessing mysql meta store. I googled and find a bug related to com.mysql.jdbc.util.ReadAheadInputStream, but don't have a workaround. And I am not sure about that. please help me. thanks. thread dump--- MyAppDefaultScheduler_Worker-2 prio=10

How to build Spark with my own version of Hadoop?

2015-07-21 Thread Dogtail Ray
Hi, I have modified some Hadoop code, and want to build Spark with the modified version of Hadoop. Do I need to change the compilation dependency files? How to then? Great thanks!

many-to-many join

2015-07-21 Thread John Berryman
Quick example problem that's stumping me: * Users have 1 or more phone numbers and therefore one or more area codes. * There are 100M users. * States have one or more area codes. * I would like to the states for the users (as indicated by phone area code). I was thinking about something like

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-21 Thread Cody Koeninger
Yeah, I'm referring to that api. If you want to filter messages in addition to catching that exception, have your mesageHandler return an option, so the type R would end up being Option[WhateverYourClassIs], then filter out None before doing the rest of your processing. If you aren't already

Re: SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi thanks for the reply. I did download from github build it and it is working fine I can use spark-submit etc when I use it in RStudio I dont know why it is saying sqlContext not found When I do the following sqlContext sparkRSQL.init(sc) Error: object sqlContext not found if I do the

Spark Streaming Checkpointing solutions

2015-07-21 Thread Emmanuel
Hi, I'm working on a Spark Streaming application and I would like to know what is the best storage to use for checkpointing. For testing purposes we're are using NFS between the worker, the master and the driver program (in client mode), but we have some issues with the CheckpointWriter (1

pyspark equivalent to Extends Serializable

2015-07-21 Thread keegan
I'm trying to define a class that contains as attributes some of Spark's objects and am running into a problem that I think would be solved I can find python's equivalent of Scala's Extends Serializable. Here's a simple class that has a Spark RDD as one of its attributes. class Foo: def

Re: No. of Task vs No. of Executors

2015-07-21 Thread shahid ashraf
Thanks All! thanks Ayan! I did the repartition to 20 so it used all cores in the cluster and was done in 3 minutes. seems data was skewed to this partition. On Tue, Jul 14, 2015 at 8:05 PM, ayan guha guha.a...@gmail.com wrote: Hi As you can see, Spark has taken data locality into

Re: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Or
Hi Andrew, Based on your driver logs, it seems the issue is that the shuffle service is actually not running on the NodeManagers, but your application is trying to provide a spark_shuffle secret anyway. One way to verify whether the shuffle service is actually started is to look at the

Re: Accumulator value 0 in driver

2015-07-21 Thread dlmarion
No, I have not. I will try that though, thank you. - Original Message - From: Ted Yu yuzhih...@gmail.com To: dlmarion dlmar...@comcast.net Cc: user user@spark.apache.org Sent: Tuesday, July 21, 2015 12:15:13 PM Subject: Re: Accumulator value 0 in driver Have you called collect()

Re: Accumulator value 0 in driver

2015-07-21 Thread dlmarion
That did it, thanks. - Original Message - From: dlmar...@comcast.net To: Ted Yu yuzhih...@gmail.com Cc: user user@spark.apache.org Sent: Tuesday, July 21, 2015 1:15:37 PM Subject: Re: Accumulator value 0 in driver No, I have not. I will try that though, thank you. -

Re: LinearRegressionWithSGD Outputs NaN

2015-07-21 Thread Burak Yavuz
Hi, Could you please decrease your step size to 0.1, and also try 0.01? You could also try running L-BFGS, which doesn't have step size tuning, to get better results. Best, Burak On Tue, Jul 21, 2015 at 2:59 AM, Naveen nav...@formcept.com wrote: Hi , I am trying to use

java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path

2015-07-21 Thread stark_summer
*java:* java version 1.7.0_60 Java(TM) SE Runtime Environment (build 1.7.0_60-b19) Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode) *scala* Scala code runner version 2.10.5 -- Copyright 2002-2013, LAMP/EPFL *hadoop cluster:* with 51 Servers that hadoop-2.3-cdh-5.1.0 version ,and

Re: RowId unique key for Dataframes

2015-07-21 Thread Srikanth
Will work. Thanks! zipWithUniqueId() doesn't guarantee continuous ID either. Srikanth On Tue, Jul 21, 2015 at 9:48 PM, Burak Yavuz brk...@gmail.com wrote: Would monotonicallyIncreasingId

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
On 7/22/15 9:03 AM, Ankit wrote: Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet ENUMs were treated as Strings in Spark SQL right? So does this mean partitioning for enums already works in previous versions too since they are just treated as strings? It’s a little bit

Re: user threads in executors

2015-07-21 Thread Shushant Arora
I can post multiple items at a time. Data is being read from kafka and filtered after that its posted . Does foreachPartition load complete partition in memory or use an iterator of batch underhood? If compete batch is not loaded will using custim size of 100-200 request in one batch and post

Re: RowId unique key for Dataframes

2015-07-21 Thread Burak Yavuz
Would monotonicallyIncreasingId https://github.com/apache/spark/blob/d4c7a7a3642a74ad40093c96c4bf45a62a470605/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L637 work for you? Best, Burak On Tue, Jul 21, 2015 at 4:55 PM, Srikanth srikanth...@gmail.com wrote: Hello, I'm

??????Timestamp functions for sqlContext

2015-07-21 Thread ??
Hi Rozen, you can get current time by call a java API and then get rowTimestamp by sql; val currentTimeStamp=System.currentTimeMillis()val rowTimestatm = sqlContext.sql(select rowTimestamp from tableName) and then you can wirte a function like this def

Re: How to share a Map among RDDS?

2015-07-21 Thread ayan guha
Either you have to do rdd.collect and then broadcast or you can do a join On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote: Hi, All, I am trying to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map(aa-1,bb-2,cc-3,...)

Re: Partition parquet data by ENUM column

2015-07-21 Thread Ankit
Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet ENUMs were treated as Strings in Spark SQL right? So does this mean partitioning for enums already works in previous versions too since they are just treated as strings? Also, is there a good way to verify that the partitioning is

Re: Spark-hive parquet schema evolution

2015-07-21 Thread Jerrick Hoang
Hi Lian, Sorry I'm new to Spark so I did not express myself very clearly. I'm concerned about the situation when let's say I have a Parquet table some partitions and I add a new column A to parquet schema and write some data with the new schema to a new partition in the table. If i'm not

Re: Convert Simple Kafka Consumer to standalone Spark JavaStream Consumer

2015-07-21 Thread Tathagata Das
From what I understand about your code, it is getting data from different partitions of a topic - get all data from partition 1, then from partition 2, etc. Though you have configured it to read from just one partition (topicCount has count = 1). So I am not sure what your intention is, read all

spark streaming disk hit

2015-07-21 Thread Abhishek R. Singh
Is it fair to say that Storm stream processing is completely in memory, whereas spark streaming would take a disk hit because of how shuffle works? Does spark streaming try to avoid disk usage out of the box? -Abhishek- - To

Re: SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread harirajaram
I'm sorry, I have no idea why it is failing on your side.I have been using this for a while now and it works fine.All I can say is use version 1.4.0 but I don't think so it is going to make a big difference.This is the one which I use,a/b are my directories.

spark thrift server supports timeout?

2015-07-21 Thread Judy Nash
Hello everyone, Does spark thrift server support timeout? Is there a documentation I can reference for questions like these? I know it support cancels, but not sure about timeout. Thanks, Judy

Re: SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread harirajaram
Yep,I saw that in your previous post and I thought it was a typing mistake that you did while posting,I never imagined that it was done on R studio.Glad it worked. -- View this message in context:

Re: Add column to DF

2015-07-21 Thread Michael Armbrust
Try instead: import org.apache.spark.sql.functions._ val determineDayPartID = udf((evntStDate: String, evntStHour: String) = { val stFormat = new java.text.SimpleDateFormat(yyMMdd) var stDateStr:String = evntStDate.substring(2,8) val stDate:Date = stFormat.parse(stDateStr) val

RE: Add column to DF

2015-07-21 Thread Stefan Panayotov
This is working! Thank you so much :) Stefan Panayotov, PhD Home: 610-355-0919 Cell: 610-517-5586 email: spanayo...@msn.com spanayo...@outlook.com spanayo...@comcast.net From: mich...@databricks.com Date: Tue, 21 Jul 2015 12:08:04 -0700 Subject: Re: Add column to DF To: spanayo...@msn.com

Partition parquet data by ENUM column

2015-07-21 Thread ankits
Hi, I am using a custom build of spark 1.4 with the parquet dependency upgraded to 1.7. I have thrift data encoded with parquet that i want to partition by a column of type ENUM. Spark programming guide says partition discovery is only supported for string and numeric columns, so it seems

Add column to DF

2015-07-21 Thread Stefan Panayotov
Hi, I am trying to ad a column to a data frame that I created based on a JSON file like this: val input = hiveCtx.jsonFile(wasb://n...@cmwhdinsightdatastore.blob.core.windows.net/json/*).toDF().persist(StorageLevel.MEMORY_AND_DISK) I have a function that is generating the values for the new

Re: user threads in executors

2015-07-21 Thread Tathagata Das
If you can post multiple items at a time, then use foreachPartition to post the whole partition in a single request. On Tue, Jul 21, 2015 at 9:35 AM, Richard Marscher rmarsc...@localytics.com wrote: You can certainly create threads in a map transformation. We do this to do concurrent DB

How to share a Map among RDDS?

2015-07-21 Thread Dan Dong
Hi, All, I am trying to access a Map from RDDs that are on different compute nodes, but without success. The Map is like: val map1 = Map(aa-1,bb-2,cc-3,...) All RDDs will have to check against it to see if the key is in the Map or not, so seems I have to make the Map itself global, the problem

Spark spark.shuffle.memoryFraction has no affect

2015-07-21 Thread wdbaruni
Hi I am testing Spark on Amazon EMR using Python and the basic wordcount example shipped with Spark. After running the application, I realized that in Stage 0 reduceByKey(add), around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to disk. Since in the wordcount example I am not

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew, Thanks for the advice. I didn't see the log in the NodeManager, so apparently, something was wrong with the yarn-site.xml configuration. After digging in more, I realize it was an user error. I'm sharing this with other people so others may know what mistake I have made. When I review

Re: spark streaming disk hit

2015-07-21 Thread Abhishek R. Singh
Thanks TD - appreciate the response ! On Jul 21, 2015, at 1:54 PM, Tathagata Das t...@databricks.com wrote: Most shuffle files are really kept around in the OS's buffer/disk cache, so it is still pretty much in memory. If you are concerned about performance, you have to do a holistic

Class Loading Issue - Spark Assembly and Application Provided

2015-07-21 Thread Ashish Soni
Hi All , I am having a class loading issue as Spark Assembly is using google guice internally and one of Jar i am using uses sisu-guice-3.1.0-no_aop.jar , How do i load my class first so that it doesn't result in error and tell spark to load its assembly later on Ashish

Re: spark streaming disk hit

2015-07-21 Thread Tathagata Das
Most shuffle files are really kept around in the OS's buffer/disk cache, so it is still pretty much in memory. If you are concerned about performance, you have to do a holistic comparison for end-to-end performance. You could take a look at this.

Spark SQL Table Caching

2015-07-21 Thread Brandon White
A few questions about caching a table in Spark SQL. 1) Is there any difference between caching the dataframe and the table? df.cache() vs sqlContext.cacheTable(tableName) 2) Do you need to warm up the cache before seeing the performance benefits? Is the cache LRU? Do you need to run some

Re: Classifier for Big Data Mining

2015-07-21 Thread Olivier Girardot
depends on your data and I guess the time/performance goals you have for both training/prediction, but for a quick answer : yes :) 2015-07-21 11:22 GMT+02:00 Chintan Bhatt chintanbhatt...@charusat.ac.in: Which classifier can be useful for mining massive datasets in spark? Decision Tree can be

Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-21 Thread wdbaruni
I am new to Spark and I understand that Spark divides the executor memory into the following fractions: *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6) *Shuffle and aggregation buffers:*

NullPointerException inside RDD when calling sc.textFile

2015-07-21 Thread MorEru
I have a number of CSV files and need to combine them into a RDD by part of their filenames. For example, for the below files $ ls 20140101_1.csv 20140101_3.csv 20140201_2.csv 20140301_1.csv 20140301_3.csv 20140101_2.csv 20140201_1.csv 20140201_3.csv I need to combine files with names

Re: Classifier for Big Data Mining

2015-07-21 Thread Chintan Bhatt
How to load dataset in apache spark? Can I know sources of massive datasets? On Wed, Jul 22, 2015 at 4:50 AM, Ron Gonzalez zlgonza...@yahoo.com.invalid wrote: I'd use Random Forest. It will give you better generalizability. There are also a number of things you can do with RF that allows to

Re: How to restart Twitter spark stream

2015-07-21 Thread Zoran Jeremic
Hi Akhil and Jorn, I tried as you suggested to create some simple scenario, but I have an error on rdd.join(newRDD): value join is not a member of org.apache.spark.rdd.RDD[twitter4j.Status]. The code looks like: val stream = TwitterUtils.createStream(ssc, auth) val filteredStream=

Question on Spark SQL for a directory

2015-07-21 Thread Ron Gonzalez
Hi, Question on using spark sql. Can someone give an example for creating table from a directory containing parquet files in HDFS instead of an actual parquet file? Thanks, Ron On 07/21/2015 01:59 PM, Brandon White wrote: A few questions about caching a table in Spark SQL. 1) Is there

Re: Classifier for Big Data Mining

2015-07-21 Thread Ron Gonzalez
I'd use Random Forest. It will give you better generalizability. There are also a number of things you can do with RF that allows to train on samples of the massive data set and then just average over the resulting models... Thanks, Ron On 07/21/2015 02:17 PM, Olivier Girardot wrote: depends

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to the master branch. https://github.com/apache/spark/pull/7048 ENUM types are actually not in the Parquet format spec, that's why we didn't have it at the first place. Basically, ENUMs are always treated as UTF8 strings in

Re: Spark-hive parquet schema evolution

2015-07-21 Thread Cheng Lian
Hey Jerrick, What do you mean by schema evolution with Hive metastore tables? Hive doesn't take schema evolution into account. Could you please give a concrete use case? Are you trying to write Parquet data with extra columns into an existing metastore Parquet table? Cheng On 7/21/15 1:04

Re: Question on Spark SQL for a directory

2015-07-21 Thread Michael Armbrust
https://spark.apache.org/docs/latest/sql-programming-guide.html#loading-data-programmatically On Tue, Jul 21, 2015 at 4:06 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid wrote: Hi, Question on using spark sql. Can someone give an example for creating table from a directory containing

Re: dataframes sql order by not total ordering

2015-07-21 Thread Carol McDonald
Thanks, that works a lot better ;) scala val results =sqlContext.sql(select movies.title, movierates.maxr, movierates.minr, movierates.cntu from(SELECT ratings.product, max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct user) as cntu FROM ratings group by ratings.product )

Re: use S3-Compatible Storage with spark

2015-07-21 Thread Akhil Das
You can add the jar in the classpath, and you can set the property like: sc.hadoopConfiguration.set(fs.s3a.endpoint,storage.sigmoid.com) Thanks Best Regards On Mon, Jul 20, 2015 at 9:41 PM, Schmirr Wurst schmirrwu...@gmail.com wrote: Thanks, that is what I was looking for... Any Idea

Re: spark streaming 1.3 issues

2015-07-21 Thread Akhil Das
I'd suggest you upgrading to 1.4 as it has better metrices and UI. Thanks Best Regards On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora shushantaror...@gmail.com wrote: Is coalesce not applicable to kafkaStream ? How to do coalesce on kafkadirectstream its not there in api ? Shall calling

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Akhil Das
Do you have HADOOP_HOME, HADOOP_CONF_DIR and hadoop's winutils.exe in the environment? Thanks Best Regards On Mon, Jul 20, 2015 at 5:45 PM, nitinkalra2000 nitinkalra2...@gmail.com wrote: Hi All, I am working on Spark 1.4 on windows environment. I have to set eventLog directory so that I can

Re: updateStateByKey schedule time

2015-07-21 Thread Anand Nalya
I also ran into a similar use case. Is this possible? On 15 July 2015 at 18:12, Michel Hubert mich...@vsnsystemen.nl wrote: Hi, I want to implement a time-out mechanism in de updateStateByKey(…) routine. But is there a way the retrieve the time of the start of the batch

RE: standalone to connect mysql

2015-07-21 Thread Jack Yang
That works! Thanks. Can I ask you one further question? How did spark sql support insertion? That is say, if I did: sqlContext.sql(insert into newStu values (“10”,”a”,1) the error is: failure: ``table'' expected but identifier newStu found insert into newStu values ('10', aa, 1) but if I did:

user threads in executors

2015-07-21 Thread Shushant Arora
Hi Can I create user threads in executors. I have a streaming app where after processing I have a requirement to push events to external system . Each post request costs ~90-100 ms. To make post parllel, I can not use same thread because that is limited by no of cores available in system , can I

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Nitin Kalra
Hi Akhil, I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's the configuration required for this ? From where can I get winutils.exe ? Thanks and Regards, Nitin Kalra On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Do you have HADOOP_HOME,

Re: Is there more information about spark shuffer-service

2015-07-21 Thread Ted Yu
To my knowledge, there is no HA for External Shuffle Service. Cheers On Jul 21, 2015, at 2:16 AM, JoneZhang joyoungzh...@gmail.com wrote: There is a saying If the service is enabled, Spark executors will fetch shuffle files from the service instead of from each other. in the wiki

Re: use S3-Compatible Storage with spark

2015-07-21 Thread Schmirr Wurst
Which version do you have ? - I tried with spark 1.4.1 for hdp 2.6, but here I had an issue that the aws-module is not there somehow: java.io.IOException: No FileSystem for scheme: s3n the same for s3a : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class

LinearRegressionWithSGD Outputs NaN

2015-07-21 Thread Naveen
Hi , I am trying to use LinearRegressionWithSGD on Million Song Data Set and my model returns NaN's as weights and 0.0 as the intercept. What might be the issue for the error ? I am using Spark 1.40 in standalone mode. Below is my model: val numIterations = 100 val stepSize = 1.0 val

Would driver shutdown cause app dead?

2015-07-21 Thread ZhuGe
Hi all:I am a bit confuse about the work of driver.In our productin enviroment, we have a spark streaming app running in standone mode. what we concern is that if the driver shutdown accidently(host shutdown or whatever). would the app running normally? Any explanation would be appreciated!!

Re: standalone to connect mysql

2015-07-21 Thread Terry Hole
Jack, You can refer the hive sql syntax if you use HiveContext: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML Thanks! -Terry That works! Thanks. Can I ask you one further question? How did spark sql support insertion? That is say, if I did: sqlContext.sql(insert

Running driver app as a daemon

2015-07-21 Thread algermissen1971
Hi, I am trying to start a driver app as a daemon using Linux' start-stop-daemon script (I need console detaching, unbuffered STDOUT/STDERR to logfile and start/stop using a PID file). I am doing this like this (which works great for the other apps we have) /sbin/start-stop-daemon -c $USER

writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-21 Thread Krzysztof Zarzycki
Hi everyone, I have pretty challenging problem with reading/writing multiple parquet files with streaming, but let me introduce my data flow: I have a lot of json events streaming to my platform. All of them have the same structure, but fields are mostly optional. Some of the fields are arrays

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Sean Owen
Great, and that file exists on HDFS and is world readable? just double-checking. What classpath is this -- your driver or executor? this is the driver, no? I assume so just because it looks like it references the assembly you built locally and from which you're launching the driver. I think

Re: Using Dataframe write with newHdoopApi

2015-07-21 Thread ayan guha
Guys Any help would be great!! I am trying to use DF and save it to Elasticsearch using newHadoopApi (because I am using python). Can anyone guide me to help if this is even possible? I have managed to use df.rdd to complete es integration but I preferred df.write. is it possible or upcoming?

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-21 Thread Nicolas Phung
Hi Cody, Thanks for your answer. I'm with Spark 1.3.0. I don't quite understand how to use the messageHandler parameter/function in the createDirectStream method. You are referring to this, aren't you ? def createDirectStream[ K: ClassTag, V: ClassTag, KD : Decoder[K]: ClassTag , VD :

Re: standalone to connect mysql

2015-07-21 Thread Jack Yang
I maybe find the answer from the sqlparser.scala file. Looks like the syntax spark used for insert is different from what we normally used for MySQL. I hope if someone can confirm this. Also I will appreciate if there is a SQL reference list available. Sent from my iPhone On 21 Jul 2015, at

Re: standalone to connect mysql

2015-07-21 Thread Jack Yang
No. I did not use hiveContext at this stage. I am talking the embedded SQL syntax for pure spark sql. Thanks, mate. On 21 Jul 2015, at 6:13 pm, Terry Hole hujie.ea...@gmail.commailto:hujie.ea...@gmail.com wrote: Jack, You can refer the hive sql syntax if you use HiveContext:

Re: k-means iteration not terminate

2015-07-21 Thread Pa Rö
thanks for this information, but i use cloudera live 5.4.4, and that have only spark 1.3. a newer version is not avaible. i don't understand this problem, first it compute some iterations and than it stop better do nothing. i think the problem is not find in the program code. maybe you know a

Re: What is the correct syntax of using Spark streamingContext.fileStream()?

2015-07-21 Thread Akhil Das
​Here's two ways of doing that: Without the filter function : JavaPairDStreamString, String foo = ssc.String, String, SequenceFileInputFormatfileStream(/tmp/foo);​ With the filter function: JavaPairInputDStreamLongWritable, Text foo = ssc.fileStream(/tmp/foo, LongWritable.class,

DataFrame writer removes fields which is null for all rows

2015-07-21 Thread Hao Ren
Consider the following code: val df = Seq((1, 3), (2, 3)).toDF(key, value).registerTempTable(tbl) sqlContext.sql(select key, null as value from tbl) .write.format(json).mode(SaveMode.Overwrite).save(test) sqlContext.read.format(json).load(test).printSchema() It shows: root |-- key: long

Re: use S3-Compatible Storage with spark

2015-07-21 Thread Akhil Das
Did you try with s3a? It seems its more like an issue with hadoop. Thanks Best Regards On Tue, Jul 21, 2015 at 2:31 PM, Schmirr Wurst schmirrwu...@gmail.com wrote: It seems to work for the credentials , but the endpoint is ignored.. : I've changed it to

Classifier for Big Data Mining

2015-07-21 Thread Chintan Bhatt
Which classifier can be useful for mining massive datasets in spark? Decision Tree can be good choice as per scalability? -- CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/ Assistant Professor, U P U Patel Department of Computer Engineering, Chandubhai S. Patel Institute of

Re: Broadcast variables in R

2015-07-21 Thread Serge Franchois
I might add to this that I've done the same exercise on Linux (CentOS 6) and there, broadcast variables ARE working. Is this functionality perhaps not exposed on Mac OS X? Or has it to do with the fact there are no native Hadoop libs for Mac? -- View this message in context:

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Akhil Das
Here are some resources which will help you with that. - http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path - https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Tue, Jul 21, 2015 at 1:57 PM, Nitin Kalra

Spark SQL/DDF's for production

2015-07-21 Thread bipin
Hi I want to ask an issue I have faced while using Spark. I load dataframes from parquet files. Some dataframes' parquet have lots of partitions, 10 million rows. Running where id = x query on dataframe scans all partitions. When saving to rdd object/parquet there is a partition column. The

Re: use S3-Compatible Storage with spark

2015-07-21 Thread Schmirr Wurst
It seems to work for the credentials , but the endpoint is ignored.. : I've changed it to sc.hadoopConfiguration.set(fs.s3n.endpoint,test.com) And I continue to get my data from amazon, how could it be ? (I also use s3n in my text url) 2015-07-21 9:30 GMT+02:00 Akhil Das

Is there more information about spark shuffer-service

2015-07-21 Thread JoneZhang
There is a saying If the service is enabled, Spark executors will fetch shuffle files from the service instead of from each other. in the wiki https://spark.apache.org/docs/1.3.0/job-scheduling.html#graceful-decommission-of-executors

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Ted Yu
Please see the tail of: https://issues.apache.org/jira/browse/SPARK-2356 On Jul 21, 2015, at 1:27 AM, Nitin Kalra nitinkalra2...@gmail.com wrote: Hi Akhil, I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's the configuration required for this ? From where can I

Re: standalone to connect mysql

2015-07-21 Thread Terry Hole
Maybe you can try: spark-submit --class sparkwithscala.SqlApp --jars /home/lib/mysql-connector-java-5.1.34.jar --master spark://hadoop1:7077 /home/myjar.jar Thanks! -Terry Hi there, I would like to use spark to access the data in mysql. So firstly I tried to run the program using:

Spark Application stuck retrying task failed on Java heap space?

2015-07-21 Thread Romi Kuntsman
Hello, *TL;DR: task crashes with OOM, but application gets stuck in infinite loop retrying the task over and over again instead of failing fast.* Using Spark 1.4.0, standalone, with DataFrames on Java 7. I have an application that does some aggregations. I played around with shuffling settings,

1.4.0 classpath issue with spark-submit

2015-07-21 Thread Michal Haris
I have a spark program that uses dataframes to query hive and I run it both as a spark-shell for exploration and I have a runner class that executes some tasks with spark-submit. I used to run against 1.4.0-SNAPSHOT. Since then 1.4.0 and 1.4.1 were released so I tried to switch to the official

log4j.xml bundled in jar vs log4.properties in spark/conf

2015-07-21 Thread igor.berman
Hi, I have log4j.xml in my jar From 1.4.1 it seems that log4j.properties in spark/conf is defined first in classpath so the spark.conf/log4j.properties wins before that (in v1.3.0) log4j.xml bundled in jar defined the configuration if I manually add my jar to be strictly first in classpath(by

Re: k-means iteration not terminate

2015-07-21 Thread Akhil Das
It could be a GC pause or something, you need to check in the stages tab and see what is taking time, If you upgrade to Spark 1.4, it has better UI and DAG visualization which helps you debug better. Thanks Best Regards On Mon, Jul 20, 2015 at 8:21 PM, Pa Rö paul.roewer1...@googlemail.com wrote:

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Arun Ahuja
Yes, I imagine it's the driver's classpath - I'm pulling those screenshots straight from the Spark UI environment page. Is there somewhere else to grab the executor class path? Also, the warning is only printing once, so it's also not clear whether the warning is from the driver or exectuor,