Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Piotr Smoliński
Hi Peter,

The blank file _SUCCESS indicates properly finished output operation.

What is the topology of your application?
I presume, you write to local filesystem and have more than one worker
machine.
In such case Spark will write the result files for each partition (in the
worker which
holds it) and complete operation writing the _SUCCESS in the driver node.

Cheers,
Piotr


On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi 
wrote:

> Both
>
> df.write.csv("/path/to/foo")
>
> and
>
> df.write.format("com.databricks.spark.csv").save("/path/to/foo")
>
> results in a *blank* file called "_SUCCESS" under /path/to/foo.
>
> My df has stuff in it.. tried this with both my real df, and a quick df
> constructed from literals.
>
> Why isn't it writing anything?
>
> Thanks,
>
> Pete
>


MLib Documentation Update Needed

2016-09-25 Thread Tobi Bosede
The loss function here

for logistic regression is confusing. It seems to imply that spark uses
only -1 and 1 class labels. However it uses 0,1 as the very inconspicuous
note quoted below (under Classification) says. We need to make this point
more visible to avoid confusion.

Better yet, we should replace the loss function listed with that for 0, 1
no matter how mathematically inconvenient, since that is what is actually
implemented in Spark.

More problematic, the loss function (even in this "convenient" form) is
actually incorrect. This is because it is missing either a summation
(sigma) in the log or product (pi) outside the log, as the loss for
logistic is the log likelihood. So there are multiple problems with the
documentation. Please advise on steps to fix for all version documentation
or if there are already some in place.

"Note that, in the mathematical formulation in this guide, a binary label
y is denoted as either +1 (positive) or −1 (negative), which is convenient
for the formulation. *However*, the negative label is represented by 0 in
spark.mllib instead of −1, to be consistent with multiclass labeling."


Re: spark stream based deduplication

2016-09-25 Thread backtrack5
Thank you @markcitizen . What I want to achieve is , say for an example

My historic rdd has
(Hash1, recordid1)
(Hash2,recordid2)

And in the new steam I have the following,
(Hash3, recordid3)
(Hash1,recordid5)

In this above scenario,
1) for recordid5,I should get recordid5 is duplicate of recordid1.
2) the new values (hash3,recordid3) should added in the historic rdd.

And I have one another question to ask,
If the problem crashes at any point, is it possible to recover that historic
rdd ?
Can i use state full stream. ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-stream-based-deduplication-tp27770p27792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: udf forces usage of Row for complex types?

2016-09-25 Thread Bedrytski Aliaksandr
Hi Koert,

these case classes you are talking about, should be serializeable to be
efficient (like kryo or just plain java serialization).

DataFrame is not simply a collection of Rows (which are serializeable by
default), it also contains a schema with different type for each column.
This way any columnar data may be represented without creating custom
case classes each time.

If you want to manipulate a collection of case classes, why not use good
old RDDs? (Or DataSets if you are using Spark 2.0)
If you want to use sql against that collection, you will need to explain
to your application how to read it as a table (by transforming it to a
DataFrame)

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote:
> after having gotten used to have case classes represent complex
> structures in Datasets, i am surprised to find out that when i work in
> DataFrames with udfs no such magic exists, and i have to fall back to
> manipulating Row objects, which is error prone and somewhat ugly.
> for example:
> case class Person(name: String, age: Int)
>
> val df = Seq((Person("john", 33), 5), (Person("mike", 30),
> 6)).toDF("person", "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age =
> p.age + 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> leads to:
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot
> be cast to Person


Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix

2016-09-25 Thread Mario Ds Briggs

Hi Daniel,

can you give it a try in the IBM's Analytics for Spark, the fix has been in
for a week now


thanks
Mario



From:   Daniel Lopes 
To: Mario Ds Briggs/India/IBM@IBMIN
Cc: Adam Roberts , user
, Steve Loughran
, Sachin Aggarwal4/India/IBM@IBMIN
Date:   14/09/2016 01:19 am
Subject:Re: Fw: Spark + Parquet + IBM Block Storage at Bluemix



Hi Mario,

Thanks for your help, so I will keeping using CSVs

Best,

Daniel Lopes
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br

On Mon, Sep 12, 2016 at 3:39 PM, Mario Ds Briggs 
wrote:
  Daniel,

  I believe it is related to
  https://issues.apache.org/jira/browse/SPARK-13979 and happens only when
  task fails in a executor (probably for some other reason u hit the latter
  in parquet and not csv).

  The PR in there, should be shortly available in IBM's Analytics for
  Spark.


  thanks
  Mario

  Inactive hide details for Adam Roberts---12/09/2016 09:37:21 pm---Mario,
  incase you've not seen this...Adam Roberts---12/09/2016 09:37:21
  pm---Mario, incase you've not seen this...

  From: Adam Roberts/UK/IBM
  To: Mario Ds Briggs/India/IBM@IBMIN
  Date: 12/09/2016 09:37 pm
  Subject: Fw: Spark + Parquet + IBM Block Storage at Bluemix


  Mario, incase you've not seen this...
 
 
 
 
 
 Adam Roberts
 
 IBM Spark   
 Team Lead   
 
 Runtime 
 Technologies
 - Hursley   
 
 
 
 
 
 
 
 
 


  - Forwarded by Adam Roberts/UK/IBM on 12/09/2016 17:06 -

  From: Daniel Lopes 
  To: Steve Loughran 
  Cc: user 
  Date: 12/09/2016 13:05
  Subject: Re: Spark + Parquet + IBM Block Storage at Bluemix




  Thanks Steve,

  But this error occurs only with parquet files, CSVs works.

  Best,

  Daniel Lopes
  Chief Data and Analytics Officer | OneMatch
  c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

  www.onematch.com.br

  On Sun, Sep 11, 2016 at 3:28 PM, Steve Loughran 
  wrote:
On 9 Sep 2016, at 17:56, Daniel Lopes <
dan...@onematch.com.br> wrote:

Hi, someone can help

I'm trying to use parquet in IBM Block Storage at Spark
but when I try to load get this error:

using this config

credentials = {
  "name": "keystone",
  "auth_url": "https://identity.open.softlayer.com;,
  "project":
"object_storage_23f274c1_d11XXXe634",
  "projectId": "XXd9c4aa39b7c7eb",
  "region": "dallas",
  "userId": "X64087180b40X2b909",
  "username": "admin_9dd810f8901d48778XX",
  "password": "chX6_",
  "domainId": "c1ddad17cfcX41",
  "domainName": "10XX",
  "role": "admin"
}

def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with
given credentials,
    so it is possible to access data using
SparkContext"""

    prefix = "fs.swift.service." 

Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Peter Figliozzi
Both

df.write.csv("/path/to/foo")

and

df.write.format("com.databricks.spark.csv").save("/path/to/foo")

results in a *blank* file called "_SUCCESS" under /path/to/foo.

My df has stuff in it.. tried this with both my real df, and a quick df
constructed from literals.

Why isn't it writing anything?

Thanks,

Pete


Re: Spark MLlib ALS algorithm

2016-09-25 Thread Roshani Nagmote
Hello,

I ran ALS algorithm on 30 c4.8xlarge machines(60GB RAM each) with
dataset(1.4GB) Netflix dataset (Users: 480189, Items: 17770, Ratings: 99M)

*Command* I run:

/usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn  --jars
/usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar netflixals_2.11-1.0.jar
--rank 200 --numIterations 30 --lambda 5e-3 --kryo s3://netflix_train
s3://netflix_test
I get following *error*:

Job aborted due to stage failure: Task 625 in stage 28.0 failed 4 times,
most recent failure: Lost task 625.3 in stage 28.0 (TID 9362, ip.ec2):
java.io.FileNotFoundException:/mnt/yarn/usercache/hadoop/appcache/application_1474477668615_0164/blockmgr-3d1ef0f7-9c9a-4495-8249-bea38e7dd347/06/shuffle_9_625_0.data.e9330598-330c-4622-afd9-27030c470f8a
(No space left on device)

I did set checkpointdir in S3 and have used checkpoint interval as 5.
Dataset is very small. So, I don't know why it won't run on 30 nodes spark
EMR cluster and it runs out of space.
Can anyone please help me with this?

Thanks,
Roshani


On Fri, Sep 23, 2016 at 11:50 PM, Nick Pentreath 
wrote:

> The scale factor was only to scale up the number of ratings in the dataset
> for performance testing purposes, to illustrate the scalability of Spark
> ALS.
>
> It is not something you would normally do on your training dataset.
>
> On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote 
> wrote:
>
>> Hello,
>>
>> I was working on Spark MLlib ALS Matrix factorization algorithm and came
>> across the following blog post:
>>
>> https://databricks.com/blog/2014/07/23/scalable-
>> collaborative-filtering-with-spark-mllib.html
>>
>> Can anyone help me understanding what "s" scaling factor does and does it
>> really give better performance? What's the significance of this?
>> If we convert input data to scaledData with the help of "s", will it
>> speedup the algorithm?
>>
>> Scaled data usage:
>> *(For each user, we create pseudo-users that have the same ratings. That
>> is, for every rating as (userId, productId, rating), we generate (userId+i,
>> productId, rating) where 0 <= i < s and s is the scaling factor)*
>>
>> Also, this blogpost is for spark 1.1 and I am currently using 2.0
>>
>> Any help will be greatly appreciated.
>>
>> Thanks,
>> Roshani
>>
>


Extract timestamp from Kafka message

2016-09-25 Thread Kevin Tran
Hi Everyone,
Does anyone know how could we extract timestamp from Kafka message in Spark
streaming ?

JavaPairInputDStream messagesDStream =
KafkaUtils.createDirectStream(
   ssc,
   String.class,
   String.class,
   StringDecoder.class,
   StringDecoder.class,
   kafkaParams,
   topics
   );


Thanks,
Kevin.


Re: ArrayType support in Spark SQL

2016-09-25 Thread Koert Kuipers
not pretty but this works:

import org.apache.spark.sql.functions.udf
df.withColumn("array", sqlf.udf({ () => Seq(1, 2, 3) }).apply())


On Sun, Sep 25, 2016 at 6:13 PM, Jason White 
wrote:

> It seems that `functions.lit` doesn't support ArrayTypes. To reproduce:
>
> org.apache.spark.sql.functions.lit(2 :: 1 :: Nil)
>
> java.lang.RuntimeException: Unsupported literal type class
> scala.collection.immutable.$colon$colon List(2, 1)
>   at
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(
> literals.scala:59)
>   at org.apache.spark.sql.functions$.lit(functions.scala:101)
>   ... 48 elided
>
> This is about the first thing I tried to do with ArrayTypes in Spark SQL.
> Is
> this usage supported, or on the roadmap?
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/ArrayType-support-
> in-Spark-SQL-tp19063.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Spark 2.0 Structured Streaming: sc.parallelize in foreach sink cause Task not serializable error

2016-09-25 Thread Jianshi
Dear all:

I am trying out the new released feature of structured streaming in Spark
2.0. I use the Structured Streaming to perform windowing by event time. I
can print out the result in the console.  I would like to write the result
to  Cassandra database through the foreach sink option. I am trying to use
the spark-cassandra-connector to save the result. The connector saves rdd to
Cassandra by calling rdd.saveToCassandra(), and this works fine if I execute
the commands in spark-shell. For example: 
import com.datastax.spark.connector._
val col = sc.parallelize(Seq(("of", 1200), ("the", "863")))
col.saveToCassandra(keyspace, table)

However, when I use the sc.parallelize inside foreach sink, it raise an
error. The input file is Json messages with each row like the following:
{"id": text, "time":timestamp,"hr": int}

Here is my code:

object StructStream {
  def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark.cassandra.connection.host",
"172.31.0.174")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val spark =
SparkSession.builder.appName("StructuredAverage").getOrCreate()
import spark.implicits._
 
val userSchema = new StructType().add("id", "string").add("hr",
"integer").add("time","timestamp")
val jsonDF =
spark.readStream.schema(userSchema).json("hdfs://ec2-52-45-70-95.compute-1.amazonaws.com:9000/test3/")
val line_count = jsonDF.groupBy(window($"time","2 minutes","1 minutes"),
$"id").count().orderBy("window")
 
import org.apache.spark.sql.ForeachWriter
 
val writer = new ForeachWriter[org.apache.spark.sql.Row] {
  override def open(partitionId: Long, version: Long) = true
  override def process(value: org.apache.spark.sql.Row) = {
val toRemove = "[]".toSet
val v_str = value.toString().filterNot(toRemove).split(",")
val v_df =
sc.parallelize(Seq(Stick(v_str(2),v_str(3).toInt,v_str(1),v_str(0
v_df.saveToCassandra("playground","sstest")
println(v_str(0),v_str(1),v_str(2),v_str(3))}
  override def close(errorOrNull: Throwable) = ()
}
 
val query =
line_count.writeStream.outputMode("complete").foreach(writer).start()
 
query.awaitTermination()
 
  }
 
}
 
case class Stick(aid: String, bct:Int, cend: String, dst: String)

*
The error message looks like this:*

Error:
org.apache.spark.SparkException: Task not serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:882)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:881)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:881)
at
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2117)
at
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2117)
at
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2117)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2116)
at
org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:69)
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:375)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:194)
at
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:184)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:120)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext


*If I remove the sc.parallelize() function, then the code works fine, and
printed out in the console:*

(2016-10-19 01:16:00.0,2016-10-19 01:18:00.0,user_test,11)
(2016-10-19 01:17:00.0,2016-10-19 01:19:00.0,user_test,11)
(2016-10-19 01:18:00.0,2016-10-19 01:20:00.0,user_test,11)
(2016-10-19 

udf forces usage of Row for complex types?

2016-09-25 Thread Koert Kuipers
after having gotten used to have case classes represent complex structures
in Datasets, i am surprised to find out that when i work in DataFrames with
udfs no such magic exists, and i have to fall back to manipulating Row
objects, which is error prone and somewhat ugly.

for example:
case class Person(name: String, age: Int)

val df = Seq((Person("john", 33), 5), (Person("mike", 30),
6)).toDF("person", "id")
val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age +
1) }).apply(col("person")))
df1.printSchema
df1.show

leads to:
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to Person


Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
Hi,

How did you install Spark 1.6? It's usually as simple as rm -rf
$SPARK_1.6_HOME, but it really depends on how you installed it in the
first place.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Sep 25, 2016 at 4:32 PM, vr spark  wrote:
> yes, i have both spark 1.6 and spark 2.0.
> I unset the spark home environment variable and pointed spark submit to 2.0.
> Its working now.
>
> How do i uninstall/remove spark 1.6 from mac?
>
> Thanks
>
>
> On Sun, Sep 25, 2016 at 4:28 AM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Can you execute run-example SparkPi with your Spark installation?
>>
>> Also, see the logs:
>>
>> 16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port
>> 4040. Attempting port 4041.
>>
>> 16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI'
>> on port 4041.
>>
>> You've got two Spark runtimes up that may or may not contribute to the
>> issue.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Sep 25, 2016 at 8:36 AM, vr spark  wrote:
>> > Hi,
>> > I have this simple scala app which works fine when i run it as scala
>> > application from the scala IDE for eclipse.
>> > But when i export is as jar and run it from spark-submit i am getting
>> > below
>> > error. Please suggest
>> >
>> > bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar
>> >
>> > 16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port
>> > 4040.
>> > Attempting port 4041.
>> >
>> > 16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI' on
>> > port
>> > 4041.
>> >
>> > 16/09/24 23:15:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
>> > http://192.168.1.3:4041
>> >
>> > 16/09/24 23:15:15 INFO SparkContext: Added JAR
>> > file:/Users/vr/Downloads/spark-2.0.0/test.jar at
>> > spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
>> >
>> > 16/09/24 23:15:15 INFO Executor: Starting executor ID driver on host
>> > localhost
>> >
>> > 16/09/24 23:15:15 INFO Utils: Successfully started service
>> > 'org.apache.spark.network.netty.NettyBlockTransferService' on port
>> > 59264.
>> >
>> > 16/09/24 23:15:15 INFO NettyBlockTransferService: Server created on
>> > 192.168.1.3:59264
>> >
>> > 16/09/24 23:15:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0
>> > (TID
>> > 0, localhost, partition 0, PROCESS_LOCAL, 5354 bytes)
>> >
>> > 16/09/24 23:15:16 INFO TaskSetManager: Starting task 1.0 in stage 0.0
>> > (TID
>> > 1, localhost, partition 1, PROCESS_LOCAL, 5354 bytes)
>> >
>> > 16/09/24 23:15:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
>> >
>> > 16/09/24 23:15:16 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
>> >
>> > 16/09/24 23:15:16 INFO Executor: Fetching
>> > spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
>> >
>> > 16/09/24 23:16:31 INFO Executor: Fetching
>> > spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
>> >
>> > 16/09/24 23:16:31 ERROR Executor: Exception in task 1.0 in stage 0.0
>> > (TID 1)
>> >
>> > java.io.IOException: Failed to connect to /192.168.1.3:59263
>> >
>> > at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>> >
>> > at
>> >
>> > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>> >
>> > at
>> >
>> > org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:358)
>> >
>> > at
>> > org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
>> >
>> > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:633)
>> >
>> > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:459)
>> >
>> > at
>> >
>> > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:488)
>> >
>> > at
>> >
>> > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:480)
>> >
>> > at
>> >
>> > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>> >
>> > at
>> >
>> > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>> >
>> > at
>> >
>> > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>> >
>> > at
>> >
>> > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>> >
>> > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>> >
>> > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>> >
>> > at
>> >
>> > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>> >
>> > at
>> >

Re: spark-submit failing but job running from scala ide

2016-09-25 Thread vr spark
yes, i have both spark 1.6 and spark 2.0.
I unset the spark home environment variable and pointed spark submit to 2.0.
Its working now.

How do i uninstall/remove spark 1.6 from mac?

Thanks


On Sun, Sep 25, 2016 at 4:28 AM, Jacek Laskowski  wrote:

> Hi,
>
> Can you execute run-example SparkPi with your Spark installation?
>
> Also, see the logs:
>
> 16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
>
> 16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI'
> on port 4041.
>
> You've got two Spark runtimes up that may or may not contribute to the
> issue.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Sep 25, 2016 at 8:36 AM, vr spark  wrote:
> > Hi,
> > I have this simple scala app which works fine when i run it as scala
> > application from the scala IDE for eclipse.
> > But when i export is as jar and run it from spark-submit i am getting
> below
> > error. Please suggest
> >
> > bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar
> >
> > 16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port
> 4040.
> > Attempting port 4041.
> >
> > 16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI' on
> port
> > 4041.
> >
> > 16/09/24 23:15:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> > http://192.168.1.3:4041
> >
> > 16/09/24 23:15:15 INFO SparkContext: Added JAR
> > file:/Users/vr/Downloads/spark-2.0.0/test.jar at
> > spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
> >
> > 16/09/24 23:15:15 INFO Executor: Starting executor ID driver on host
> > localhost
> >
> > 16/09/24 23:15:15 INFO Utils: Successfully started service
> > 'org.apache.spark.network.netty.NettyBlockTransferService' on port
> 59264.
> >
> > 16/09/24 23:15:15 INFO NettyBlockTransferService: Server created on
> > 192.168.1.3:59264
> >
> > 16/09/24 23:15:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0
> (TID
> > 0, localhost, partition 0, PROCESS_LOCAL, 5354 bytes)
> >
> > 16/09/24 23:15:16 INFO TaskSetManager: Starting task 1.0 in stage 0.0
> (TID
> > 1, localhost, partition 1, PROCESS_LOCAL, 5354 bytes)
> >
> > 16/09/24 23:15:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> >
> > 16/09/24 23:15:16 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> >
> > 16/09/24 23:15:16 INFO Executor: Fetching
> > spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
> >
> > 16/09/24 23:16:31 INFO Executor: Fetching
> > spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
> >
> > 16/09/24 23:16:31 ERROR Executor: Exception in task 1.0 in stage 0.0
> (TID 1)
> >
> > java.io.IOException: Failed to connect to /192.168.1.3:59263
> >
> > at
> > org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:228)
> >
> > at
> > org.apache.spark.network.client.TransportClientFactory.createClient(
> TransportClientFactory.java:179)
> >
> > at
> > org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(
> NettyRpcEnv.scala:358)
> >
> > at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(
> NettyRpcEnv.scala:324)
> >
> > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:633)
> >
> > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:459)
> >
> > at
> > org.apache.spark.executor.Executor$$anonfun$org$apache$
> spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:488)
> >
> > at
> > org.apache.spark.executor.Executor$$anonfun$org$apache$
> spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:480)
> >
> > at
> > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> TraversableLike.scala:733)
> >
> > at
> > scala.collection.mutable.HashMap$$anonfun$foreach$1.
> apply(HashMap.scala:99)
> >
> > at
> > scala.collection.mutable.HashMap$$anonfun$foreach$1.
> apply(HashMap.scala:99)
> >
> > at
> > scala.collection.mutable.HashTable$class.foreachEntry(
> HashTable.scala:230)
> >
> > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> >
> > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> >
> > at
> > scala.collection.TraversableLike$WithFilter.
> foreach(TraversableLike.scala:732)
> >
> > at
> > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$
> updateDependencies(Executor.scala:480)
> >
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:252)
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >
> > at java.lang.Thread.run(Thread.java:745)
> >
> >
> >
> >
> >
> > My Scala code
> >
> >
> > package com.x.y.vr.spark.first
> >
> > /* SimpleApp.scala */
> >
> > import 

Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Marco Mistroni
Hi
 in fact i have  just found  some written notes in my code see if this
docs help you (it will work with any spark versions, not only 1.3.0)

https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#creating-dataframes
hth


On Sun, Sep 25, 2016 at 1:25 PM, Marco Mistroni  wrote:

> Hi
>
>  i must admit , i had issues as well in finding a  sample that does that,
> (hopefully Spark folks can add more examples or someone on the list can
> post a sample code?)
>
> hopefully you can reuse sample below
> So,  you start from an rdd of doubles (myRdd)
>
> ## make a row
> val toRddOfRows = myRdd.map(doubleValues => Row.fromSeq(doubleValues)
>
> # then you can either call toDF directly. spk will build a schema for
> you..beware you will need to import   import org.apache.spark.sql.
> SQLImplicits
>
> val df = toRddOfRows.toDF()
>
> # or you can create a schema  yourself
> def createSchema(row: Row) = {
> val first = row.toSeq
> val firstWithIdx = first.zipWithIndex
> val fields = firstWithIdx.map(tpl => StructField("Col" + tpl._2,
> DoubleType, false))
> StructType(fields)
>
>   }
>
> val mySchema =  createSchema(toRddOfRow.first())
>
> // returning DataFrame
> val mydf =   sqlContext.createDataFrame(toRddOfRow, schema)
>
>
> hth
>
>
>
>
>
> U need to define a schema to make a df out of your list... check spark
> docs on how to make a df or some machine learning examples
>
> On 25 Sep 2016 12:57 pm, "Dan Bikle"  wrote:
>
>> Hello World,
>>
>> I am familiar with Python and I am learning Spark-Scala.
>>
>> I want to build a DataFrame which has structure desribed by this syntax:
>>
>>
>>
>>
>>
>>
>>
>>
>> *// Prepare training data from a list of (label, features) tuples.val
>> training = spark.createDataFrame(Seq(  (1.1, Vectors.dense(1.1, 0.1)),
>> (0.2, Vectors.dense(1.0, -1.0)),  (3.0, Vectors.dense(1.3, 1.0)),  (1.0,
>> Vectors.dense(1.2, -0.5.toDF("label", "features")*
>> I got the above syntax from this URL:
>>
>> http://spark.apache.org/docs/latest/ml-pipeline.html
>>
>> Currently my data is in array which I had pulled out of a DF:
>>
>>
>> *val my_a = gspc17_df.collect().map{row =>
>> Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}*
>> The structure of my array is very similar to the above DF:
>>
>>
>>
>>
>>
>>
>> *my_a: Array[Seq[Any]] =Array(  List(-1.4830674013266898,
>> [-0.004192832940431825,-0.003170667657263393]),  List(-0.05876766500768526,
>> [-0.008462913654529357,-0.006880595828929472]),  List(1.0109273250546658,
>> [-3.1816797620416693E-4,-0.006502619326182358]))*
>> How to copy data from my array into a DataFrame which has the above
>> structure?
>>
>> I tried this syntax:
>>
>>
>> *val my_df = spark.createDataFrame(my_a).toDF("label","features")*
>> Spark barked at me:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *:105: error: inferred type arguments [Seq[Any]] do not conform
>> to method createDataFrame's type parameter bounds [A <: Product]   val
>> my_df =
>> spark.createDataFrame(my_a).toDF("label","features")
>> ^:105: error: type mismatch; found   :
>> scala.collection.mutable.WrappedArray[Seq[Any]] required: Seq[A]   val
>> my_df =
>> spark.createDataFrame(my_a).toDF("label","features")
>> ^scala> *
>>
>


Re: In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Marco Mistroni
Hi

 i must admit , i had issues as well in finding a  sample that does that,
(hopefully Spark folks can add more examples or someone on the list can
post a sample code?)

hopefully you can reuse sample below
So,  you start from an rdd of doubles (myRdd)

## make a row
val toRddOfRows = myRdd.map(doubleValues => Row.fromSeq(doubleValues)

# then you can either call toDF directly. spk will build a schema for
you..beware you will need to import   import
org.apache.spark.sql.SQLImplicits

val df = toRddOfRows.toDF()

# or you can create a schema  yourself
def createSchema(row: Row) = {
val first = row.toSeq
val firstWithIdx = first.zipWithIndex
val fields = firstWithIdx.map(tpl => StructField("Col" + tpl._2,
DoubleType, false))
StructType(fields)

  }

val mySchema =  createSchema(toRddOfRow.first())

// returning DataFrame
val mydf =   sqlContext.createDataFrame(toRddOfRow, schema)


hth





U need to define a schema to make a df out of your list... check spark docs
on how to make a df or some machine learning examples

On 25 Sep 2016 12:57 pm, "Dan Bikle"  wrote:

> Hello World,
>
> I am familiar with Python and I am learning Spark-Scala.
>
> I want to build a DataFrame which has structure desribed by this syntax:
>
>
>
>
>
>
>
>
> *// Prepare training data from a list of (label, features) tuples.val
> training = spark.createDataFrame(Seq(  (1.1, Vectors.dense(1.1, 0.1)),
> (0.2, Vectors.dense(1.0, -1.0)),  (3.0, Vectors.dense(1.3, 1.0)),  (1.0,
> Vectors.dense(1.2, -0.5.toDF("label", "features")*
> I got the above syntax from this URL:
>
> http://spark.apache.org/docs/latest/ml-pipeline.html
>
> Currently my data is in array which I had pulled out of a DF:
>
>
> *val my_a = gspc17_df.collect().map{row =>
> Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}*
> The structure of my array is very similar to the above DF:
>
>
>
>
>
>
> *my_a: Array[Seq[Any]] =Array(  List(-1.4830674013266898,
> [-0.004192832940431825,-0.003170667657263393]),  List(-0.05876766500768526,
> [-0.008462913654529357,-0.006880595828929472]),  List(1.0109273250546658,
> [-3.1816797620416693E-4,-0.006502619326182358]))*
> How to copy data from my array into a DataFrame which has the above
> structure?
>
> I tried this syntax:
>
>
> *val my_df = spark.createDataFrame(my_a).toDF("label","features")*
> Spark barked at me:
>
>
>
>
>
>
>
>
>
>
> *:105: error: inferred type arguments [Seq[Any]] do not conform
> to method createDataFrame's type parameter bounds [A <: Product]   val
> my_df =
> spark.createDataFrame(my_a).toDF("label","features")
> ^:105: error: type mismatch; found   :
> scala.collection.mutable.WrappedArray[Seq[Any]] required: Seq[A]   val
> my_df =
> spark.createDataFrame(my_a).toDF("label","features")
> ^scala> *
>


In Spark-Scala, how to copy Array of Lists into new DataFrame?

2016-09-25 Thread Dan Bikle
Hello World,

I am familiar with Python and I am learning Spark-Scala.

I want to build a DataFrame which has structure desribed by this syntax:








*// Prepare training data from a list of (label, features) tuples.val
training = spark.createDataFrame(Seq(  (1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),  (3.0, Vectors.dense(1.3, 1.0)),  (1.0,
Vectors.dense(1.2, -0.5.toDF("label", "features")*
I got the above syntax from this URL:

http://spark.apache.org/docs/latest/ml-pipeline.html

Currently my data is in array which I had pulled out of a DF:


*val my_a = gspc17_df.collect().map{row =>
Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}*
The structure of my array is very similar to the above DF:






*my_a: Array[Seq[Any]] =Array(  List(-1.4830674013266898,
[-0.004192832940431825,-0.003170667657263393]),  List(-0.05876766500768526,
[-0.008462913654529357,-0.006880595828929472]),  List(1.0109273250546658,
[-3.1816797620416693E-4,-0.006502619326182358]))*
How to copy data from my array into a DataFrame which has the above
structure?

I tried this syntax:


*val my_df = spark.createDataFrame(my_a).toDF("label","features")*
Spark barked at me:










*:105: error: inferred type arguments [Seq[Any]] do not conform to
method createDataFrame's type parameter bounds [A <: Product]   val
my_df =
spark.createDataFrame(my_a).toDF("label","features")
^:105: error: type mismatch; found   :
scala.collection.mutable.WrappedArray[Seq[Any]] required: Seq[A]   val
my_df =
spark.createDataFrame(my_a).toDF("label","features")
^scala> *


Re: spark-submit failing but job running from scala ide

2016-09-25 Thread Jacek Laskowski
Hi,

Can you execute run-example SparkPi with your Spark installation?

Also, see the logs:

16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.

16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI'
on port 4041.

You've got two Spark runtimes up that may or may not contribute to the issue.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Sep 25, 2016 at 8:36 AM, vr spark  wrote:
> Hi,
> I have this simple scala app which works fine when i run it as scala
> application from the scala IDE for eclipse.
> But when i export is as jar and run it from spark-submit i am getting below
> error. Please suggest
>
> bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar
>
> 16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port 4040.
> Attempting port 4041.
>
> 16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI' on port
> 4041.
>
> 16/09/24 23:15:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://192.168.1.3:4041
>
> 16/09/24 23:15:15 INFO SparkContext: Added JAR
> file:/Users/vr/Downloads/spark-2.0.0/test.jar at
> spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
>
> 16/09/24 23:15:15 INFO Executor: Starting executor ID driver on host
> localhost
>
> 16/09/24 23:15:15 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59264.
>
> 16/09/24 23:15:15 INFO NettyBlockTransferService: Server created on
> 192.168.1.3:59264
>
> 16/09/24 23:15:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, localhost, partition 0, PROCESS_LOCAL, 5354 bytes)
>
> 16/09/24 23:15:16 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
> 1, localhost, partition 1, PROCESS_LOCAL, 5354 bytes)
>
> 16/09/24 23:15:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
>
> 16/09/24 23:15:16 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
>
> 16/09/24 23:15:16 INFO Executor: Fetching
> spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
>
> 16/09/24 23:16:31 INFO Executor: Fetching
> spark://192.168.1.3:59263/jars/test.jar with timestamp 1474784115210
>
> 16/09/24 23:16:31 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
>
> java.io.IOException: Failed to connect to /192.168.1.3:59263
>
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>
> at
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:358)
>
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
>
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:633)
>
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:459)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:488)
>
> at
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:480)
>
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>
> at
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:480)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:252)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
>
> My Scala code
>
>
> package com.x.y.vr.spark.first
>
> /* SimpleApp.scala */
>
> import org.apache.spark.SparkContext
>
> import org.apache.spark.SparkContext._
>
> import org.apache.spark.SparkConf
>
> object SimpleApp {
>
>   def main(args: Array[String]) {
>
> val logFile = "/Users/vttrich/Downloads/spark-2.0.0/README.md" // Should
> be some file on your system
>
> val conf = new SparkConf().setAppName("Simple Application")
>
> val sc = new SparkContext("local[*]", "RatingsCounter")
>
> //val sc = new SparkContext(conf)
>
> val logData = sc.textFile(logFile, 2).cache()
>
> val numAs = logData.filter(line => line.contains("a")).count()
>
> val numBs = 

Re: How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Jörn Franke
Use a tool like flume and/or oozie to reliable download files from http and do 
error handling (e.g. Requests time out). Afterwards process the data with spark.

> On 25 Sep 2016, at 10:27, Dan Bikle  wrote:
> 
> hello spark-world,
> 
> How to use Spark-Scala to download a CSV file from the web and load the file 
> into a spark-csv DataFrame?
> 
> Currently I depend on curl in a shell command to get my CSV file.
> 
> Here is the syntax I want to enhance:
> 
> /* fb_csv.scala
> This script should load FB prices from Yahoo.
> 
> Demo:
> spark-shell -i fb_csv.scala
> */
> 
> // I should get prices:
> import sys.process._
> "/usr/bin/curl -o /tmp/fb.csv http://ichart.finance.yahoo.com/table.csv?s=FB;!
> 
> import org.apache.spark.sql.SQLContext
> 
> val sqlContext = new SQLContext(sc)
> 
> val fb_df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("/tmp/fb.csv")
> 
> fb_df.head(9)
> 
> I want to enhance the above script so it is pure Scala with no shell syntax 
> inside.
> 


Re: How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Marco Mistroni
Hi
 not sure if spark-csv supports the http:// format you use to load data
from the WEB.  I just tried this and got exception

scala> val df = sqlContext.read.
 | format("com.databricks.spark.csv").
 | option("inferSchema", "true").
 | load("http://ichart.finance.yahoo.com/table.csv?s=FB;)
16/09/25 10:08:09 WARN : Your hostname, MarcoLaptop resolves to a
loopback/non-reachable address: fe80:0:0:0:3c1f:e7b4:c7cc:d2bd%wlan3, but
we couldn't find any external IP address!
java.io.IOException: No FileSystem for scheme: http


But, it supports reading from a csv file, so you could write a spark
program that
1. download your FB data from yahoo (i have code which is doiing exactly
what you are doing and i am using com.github.tototoshi.csv  package for
downloading csv data from web)
2 . create an RDD out of that (or a DataFrame)
3. do whatever processing you need

hth


How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Dan Bikle
hello spark-world,

How to use Spark-Scala to download a CSV file from the web and load the
file into a spark-csv DataFrame?

Currently I depend on curl in a shell command to get my CSV file.

Here is the syntax I want to enhance:



















*/* fb_csv.scalaThis script should load FB prices from
Yahoo.Demo:spark-shell -i fb_csv.scala*/// I should get prices:import
sys.process._"/usr/bin/curl -o /tmp/fb.csv
http://ichart.finance.yahoo.com/table.csv?s=FB
"!import
org.apache.spark.sql.SQLContextval sqlContext = new SQLContext(sc)val fb_df
=
sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("/tmp/fb.csv")fb_df.head(9)*
I want to enhance the above script so it is pure Scala with no shell syntax
inside.


Re: Open source Spark based projects

2016-09-25 Thread Simon Chan
PredictionIO is an open-source machine learning server project based on
Spark - http://predictionio.incubator.apache.org/


Simon

On Fri, Sep 23, 2016 at 12:46 PM, manasdebashiskar 
wrote:

> check out spark packages https://spark-packages.org/ and you will find few
> awesome and a lot of super awesome projects.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778p27788.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Left Join Yields Results And Not Results

2016-09-25 Thread Aaron Jackson
Hi,

I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very
odd situation.  I have two dataframes, base and updated.  The "updated"
dataframe contains constrained subset of data from "base" that I wish to
excluded.  Something like this.

updated = base.where(base.X = F.lit(1000))

It's more complicated than that, but you get the idea.

Later, I do a left join.

base.join(updated, 'Core_Column', 'left_outer')

This should return all values in base and null where updated doesn't have
an equality match.  And that's almost true, but here's where it gets
strange.

base.join(updated, 'Core_Column', 'left_outer').select(base.FieldId,
updated.FieldId, 'updated.*').show()

|FieldId|FieldId|FieldId|x|y|z
|123|123|null|1|2|3

Now I understand why base.FieldId shows 123, but why does updated.FieldId
show 123 as well, when the expanded join for 'updated.*' shows null.  I can
what I want to do by using an RDD, but I was hoping to avoid bypassing
tungsten.

It almost feels like it's optimizing the field based on the join.  But I
tested other fields as well and they also came back with values from base.
Very odd.

Any thoughts?

Aaron


spark-submit failing but job running from scala ide

2016-09-25 Thread vr spark
Hi,
I have this simple scala app which works fine when i run it as scala
application from the scala IDE for eclipse.
But when i export is as jar and run it from spark-submit i am getting below
error. Please suggest

*bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar*

16/09/24 23:15:15 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.

16/09/24 23:15:15 INFO Utils: Successfully started service 'SparkUI' on
port 4041.

16/09/24 23:15:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://192.168.1.3:4041

16/09/24 23:15:15 INFO SparkContext: Added JAR
file:/Users/vr/Downloads/spark-2.0.0/test.jar at spark://
192.168.1.3:59263/jars/test.jar with timestamp 1474784115210

16/09/24 23:15:15 INFO Executor: Starting executor ID driver on host
localhost

16/09/24 23:15:15 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 59264.

16/09/24 23:15:15 INFO NettyBlockTransferService: Server created on
192.168.1.3:59264

16/09/24 23:15:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, partition 0, PROCESS_LOCAL, 5354 bytes)

16/09/24 23:15:16 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, localhost, partition 1, PROCESS_LOCAL, 5354 bytes)

16/09/24 23:15:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

16/09/24 23:15:16 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)

16/09/24 23:15:16 INFO Executor: Fetching spark://
192.168.1.3:59263/jars/test.jar with timestamp 1474784115210

16/09/24 23:16:31 INFO Executor: Fetching spark://
192.168.1.3:59263/jars/test.jar with timestamp 1474784115210

16/09/24 23:16:31 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)

java.io.IOException: Failed to connect to /192.168.1.3:59263

at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)

at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)

at
org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:358)

at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)

at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:633)

at org.apache.spark.util.Utils$.fetchFile(Utils.scala:459)

at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:488)

at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:480)

at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)

at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)

at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)

at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)

at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)

at org.apache.spark.executor.Executor.org
$apache$spark$executor$Executor$$updateDependencies(Executor.scala:480)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:252)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)




*My Scala code*


package com.x.y.vr.spark.first

/* SimpleApp.scala */

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf

object SimpleApp {

  def main(args: Array[String]) {

val logFile = "/Users/vttrich/Downloads/spark-2.0.0/README.md" //
Should be some file on your system

val conf = new SparkConf().setAppName("Simple Application")

val sc = new SparkContext("local[*]", "RatingsCounter")

//val sc = new SparkContext(conf)

val logData = sc.textFile(logFile, 2).cache()

val numAs = logData.filter(line => line.contains("a")).count()

val numBs = logData.filter(line => line.contains("b")).count()

println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

  }

}