Re: Streaming K-means not printing predictions

2016-04-27 Thread Ashutosh Kumar
The problem seems to be streamconxt.textFileStream(path) is not reading the
file at all. It does not throw any exception also. I tried some tricks
given in mailing lists  like copying the file to specified directory  after
start of program, touching the file to change timestamp etc but no luck.

Thanks
Ashutosh


On Wed, Apr 27, 2016 at 2:43 PM, Niki Pavlopoulou  wrote:

> One of the reasons that happened to me (assuming everything is ok on your
> streaming process), is if you run it on local mode instead of local[*] use
> local[4].
>
> On 26 April 2016 at 15:10, Ashutosh Kumar 
> wrote:
>
>> I created a Streaming k means based on scala example. It keeps running
>> without any error but never prints predictions
>>
>> Here is Log
>>
>> 19:15:05,050 INFO
>> org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
>> batch metadata: 146167824 ms
>> 19:15:10,001 INFO
>> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
>> files took 1 ms
>> 19:15:10,001 INFO
>> org.apache.spark.streaming.dstream.FileInputDStream   - New files
>> at time 146167831 ms:
>>
>> 19:15:10,007 INFO
>> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
>> files took 2 ms
>> 19:15:10,007 INFO
>> org.apache.spark.streaming.dstream.FileInputDStream   - New files
>> at time 146167831 ms:
>>
>> 19:15:10,014 INFO
>> org.apache.spark.streaming.scheduler.JobScheduler - Added jobs
>> for time 146167831 ms
>> 19:15:10,015 INFO
>> org.apache.spark.streaming.scheduler.JobScheduler - Starting
>> job streaming job 146167831 ms.0 from job set of time 146167831 ms
>> 19:15:10,028 INFO
>> org.apache.spark.SparkContext - Starting
>> job: collect at StreamingKMeans.scala:89
>> 19:15:10,028 INFO
>> org.apache.spark.scheduler.DAGScheduler   - Job 292
>> finished: collect at StreamingKMeans.scala:89, took 0.41 s
>> 19:15:10,029 INFO
>> org.apache.spark.streaming.scheduler.JobScheduler - Finished
>> job streaming job 146167831 ms.0 from job set of time 146167831 ms
>> 19:15:10,029 INFO
>> org.apache.spark.streaming.scheduler.JobScheduler - Starting
>> job streaming job 146167831 ms.1 from job set of time 146167831 ms
>> ---
>> Time: 146167831 ms
>> ---
>>
>> 19:15:10,036 INFO
>> org.apache.spark.streaming.scheduler.JobScheduler - Finished
>> job streaming job 146167831 ms.1 from job set of time 146167831 ms
>> 19:15:10,036 INFO
>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>> RDD 2912 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>> RDD 2911 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.storage.BlockManager - Removing
>> RDD 2912
>> 19:15:10,037 INFO
>> org.apache.spark.streaming.scheduler.JobScheduler - Total
>> delay: 0.036 s for time 146167831 ms (execution: 0.021 s)
>> 19:15:10,037 INFO
>> org.apache.spark.rdd.UnionRDD - Removing
>> RDD 2800 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.storage.BlockManager - Removing
>> RDD 2911
>> 19:15:10,037 INFO
>> org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
>> old files that were older than 146167825 ms: 1461678245000 ms
>> 19:15:10,037 INFO
>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>> RDD 2917 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.storage.BlockManager - Removing
>> RDD 2800
>> 19:15:10,037 INFO
>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>> RDD 2916 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>> RDD 2915 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.rdd.MapPartitionsRDD - Removing
>> RDD 2914 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.rdd.UnionRDD - Removing
>> RDD 2803 from persistence list
>> 19:15:10,037 INFO
>> org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
>> old files that were older than 146167825 ms: 1461678245000 ms
>> 19:15:10,038 INFO
>> org.apache.spark.streaming.scheduler.ReceivedBlockTracker - Deleting
>> batches ArrayBuffer()
>> 19:15:10,038 INFO
>> org.apache.spark.storage.BlockManager - Removing
>> RDD 2917
>> 19:15:10,038 INFO
>> org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
>> batch metadata: 1461678245000 ms
>> 19:15:10,038 INFO
>> org.apache.spark.storage.BlockManager - 

Re: Cant join same dataframe twice ?

2016-04-27 Thread Ted Yu
I wonder if Spark can provide better support for this case.

The following schema is not user friendly (shown previsouly):

StructField(b,IntegerType,false), StructField(b,IntegerType,false)

Except for 'select *', there is no way for user to query any of the two
fields.

On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro 
wrote:

> Based on my example, how about renaming columns?
>
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
> df2("b").as("2-b"))
> val df4 = df3.join(df2, df3("2-b") === df2("b"))
>
> // maropu
>
> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot 
> wrote:
>
>> Correct Takeshi
>> Even I am facing the same issue .
>>
>> How to avoid the ambiguity ?
>>
>>
>> On 27 April 2016 at 11:54, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> I tried;
>>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>> val df3 = df1.join(df2, "a")
>>> val df4 = df3.join(df2, "b")
>>>
>>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>>> ambiguous, could be: b#6, b#14.;
>>> If same case, this message makes sense and this is clear.
>>>
>>> Thought?
>>>
>>> // maropu
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla 
>>> wrote:
>>>
 Also, check the column names of df1 ( after joining df2 and df3 ).

 Prasad.

 From: Ted Yu
 Date: Monday, April 25, 2016 at 8:35 PM
 To: Divya Gehlot
 Cc: "user @spark"
 Subject: Re: Cant join same dataframe twice ?

 Can you show us the structure of df2 and df3 ?

 Thanks

 On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
 wrote:

> Hi,
> I am using Spark 1.5.2 .
> I have a use case where I need to join the same dataframe twice on two
> different columns.
> I am getting error missing Columns
>
> For instance ,
> val df1 = df2.join(df3,"Column1")
> Below throwing error missing columns
> val df 4 = df1.join(df3,"Column2")
>
> Is the bug or valid scenario ?
>
>
>
>
> Thanks,
> Divya
>


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-27 Thread SRK
Hi,

We seem to be getting a lot of LeaderLostExceptions and our source Stream is
working with a default value of rebalance.backoff.ms which is 2000. I was
thinking to increase this value to 5000. Any suggestions on  this?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-default-value-of-rebalance-backoff-ms-in-Spark-Kafka-Direct-tp26840.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



unsubscribe

2016-04-27 Thread Varanasi, Venkata






--
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.


Transformation question

2016-04-27 Thread Eduardo
Is there a way to write a transformation that for each entry of an RDD uses
certain other values of another RDD? As an example, image you have a RDD of
entries to predict a certain label. In a second RDD, you have historical
data. So for each entry in the first RDD, you want to find similar entries
in the second RDD and take, let's say, the average. Does that fit the Spark
model? Is there any alternative?

Thanks in advance


Re: getting ClassCastException when calling UDF

2016-04-27 Thread Paras sachdeva
Hi

Can you try below :

We are registering using spark.sql.function.udf :

def *myUDF*(wgts: *Int*, amnt: *Float*) = (wgts*amnt)/100.asInstanceOf[
Float]

val *myUdf* = udf(myUDF(_:int,_:Float))
>
Now you can invoke the function directly in spark sql or outside.

Thanks,
Paras Sachdeva

On Wed, Apr 27, 2016 at 1:18 PM, Divya Gehlot 
wrote:

> Hi,
> I am using Spark 1.5.2 and defined   below udf
>
> import org.apache.spark.sql.functions.udf
>> val myUdf  = (wgts : Int , amnt :Float) => {
>> (wgts*amnt)/100.asInstanceOf[Float]
>> }
>>
>
>
>
> val df2 = df1.withColumn("WEIGHTED_AMOUNT",callUDF(udfcalWghts,
> FloatType,col("RATE"),col("AMOUNT")))
>
> In my schema RATE is in integerType and Amount FLOATTYPE
>
> I am getting below error for
>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 106 in stage 89.0 failed 4 times, most recent failure: Lost task 106.3 in
>> stage 89.0 (TID 7735, ip-xx-xx-xx-xxx.ap-southeast-1.compute.internal):
>> java.lang.ClassCastException: java.lang.Double cannot be cast to
>> java.lang.Float
>> at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)
>> at
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(
>
>
>
>
> Similar  issue is logged at JavaScript API for Apache Spark
> https://github.com/EclairJS/eclairjs-nashorn/issues/3
> 
>
> Can somebody help me with the resolution ?
>
>
>
>
>
>
>
>
>
>
>
> Thanks,
> Divya
>
>
>


Re: Save DataFrame to HBase

2016-04-27 Thread Paras sachdeva
Hi Daniel,

Would you possibly be able to share the snipped to code you have used ?

Thank you.

On Wed, Apr 27, 2016 at 3:13 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi Benjamin,
> Yes it should work.
>
> Let me know if you need further assistance I might be able to get the code
> I've used for that project.
>
> Thank you.
> Daniel
>
> On 24 Apr 2016, at 17:35, Benjamin Kim  wrote:
>
> Hi Daniel,
>
> How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed
> which comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?
>
> Thanks,
> Ben
>
> On Apr 24, 2016, at 1:43 AM, Daniel Haviv 
> wrote:
>
> Hi,
> I tried saving DF to HBase using a hive table with hbase storage handler
> and hiveContext but it failed due to a bug.
>
> I was able to persist the DF to hbase using Apache Pheonix which was
> pretty simple.
>
> Thank you.
> Daniel
>
> On 21 Apr 2016, at 16:52, Benjamin Kim  wrote:
>
> Has anyone found an easy way to save a DataFrame into HBase?
>
> Thanks,
> Ben
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-27 Thread Cody Koeninger
Seems like it'd be better to look into the Kafka side of things to
determine why you're losing leaders frequently, as opposed to trying
to put a bandaid on it.

On Wed, Apr 27, 2016 at 11:49 AM, SRK  wrote:
> Hi,
>
> We seem to be getting a lot of LeaderLostExceptions and our source Stream is
> working with a default value of rebalance.backoff.ms which is 2000. I was
> thinking to increase this value to 5000. Any suggestions on  this?
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-default-value-of-rebalance-backoff-ms-in-Spark-Kafka-Direct-tp26840.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How can I bucketize / group a DataFrame from parquet files?

2016-04-27 Thread Michael Armbrust
Unfortunately, I don't think there is an easy way to do this in 1.6.  In
Spark 2.0 we will make DataFrame = Dataset[Row], so this should work out of
the box.

On Mon, Apr 25, 2016 at 11:08 PM, Brandon White 
wrote:

> I am creating a dataFrame from parquet files. The schema is based on the
> parquet files, I do not know it before hand. What I want to do is group the
> entire DF into buckets based on a column.
>
> val df = sqlContext.read.parquet("/path/to/files")
> val groupedBuckets: DataFrame[String, Array[Rows]] =
> df.groupBy($"columnName")
>
> I know this does not work because the DataFrame's groupBy is only used for
> aggregate functions. I cannot convert my DataFrame to a DataSet because I
> do not have a case class for the DataSet schema. The only thing I can do is
> convert the df to an RDD[Rows] and try to deal with the types. This is ugly
> and difficult.
>
> Is there any better way? Can I convert a DataFrame to a DataSet without a
> predefined case class?
>
> Brandon
>


Re: Transformation question

2016-04-27 Thread Mathieu Longtin
I would make a DataFrame (or DataSet) out of the RDD and use SQL join.

On Wed, Apr 27, 2016 at 2:50 PM Eduardo  wrote:

> Is there a way to write a transformation that for each entry of an RDD
> uses certain other values of another RDD? As an example, image you have a
> RDD of entries to predict a certain label. In a second RDD, you have
> historical data. So for each entry in the first RDD, you want to find
> similar entries in the second RDD and take, let's say, the average. Does
> that fit the Spark model? Is there any alternative?
>
> Thanks in advance
>
-- 
Mathieu Longtin
1-514-803-8977


Error running spark-sql-perf version 0.3.2 against Spark 1.6

2016-04-27 Thread Michael Slavitch
Hello;

I'm trying to run spark-sql-perf  version 0.3.2  (hash cb0347b) against Spark 
1.6,  I get the following when running

 ./bin/run --benchmark DatsetPerformance 

Exception in thread "main" java.lang.ClassNotFoundException: 
com.databricks.spark.sql.perf.DatsetPerformance

Even though the classes are built.  How do I resolve this?


Another question:  Why does the current version of the benchmark run only 
against spark 2.0?




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Dataframe fails for large resultsize

2016-04-27 Thread Buntu Dev
I got 14GB of parquet data and when trying to apply order by using spark
sql and save the first 1M rows but keeps failing with "Connection reset by
peer: socket write error" on the executors.

I've allocated about 10g to both driver and the executors along with
setting the maxResultSize to 10g but still fails with the same error. I'm
using Spark 1.5.1.

Are there any other alternative ways to handle this?

Thanks!


Re: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Esa Heikkinen


Hi

Thanks for the answer.

I have developed a log file analyzer for RTPIS (Real Time Passenger 
Information System) system, where buses drive lines and the system try 
to estimate the arrival times to the bus stops. There are many different 
log files (and events) and analyzing situation can be very complex. Also 
spatial data can be included to the log data.


The analyzer also has a query (or analyzing) language, which describes a 
expected behavior. This can be a requirement of system. Analyzer can be 
think to be also a test oracle.


I have published a paper (SPLST'15 conference) about my analyzer and its 
language. The paper is maybe too technical, but it is found:

http://ceur-ws.org/Vol-1525/paper-19.pdf

I do not know yet where it belongs. I think it can be some "CEP with 
delays". Or do you know better ?
My analyzer can also do little bit more complex and time-consuming 
analyzings? There are no a need for real time.


And it is possible to do "CEP with delays" reasonably some existing 
analyzer (for example Spark) ?


Regards
PhD student at Tampere University of Technology, Finland, www.tut.fi 


Esa Heikkinen

27.4.2016, 15:51, Michael Segel kirjoitti:

Spark and CEP? It depends…

Ok, I know that’s not the answer you want to hear, but its a bit more 
complicated…


If you consider Spark Streaming, you have some issues.
Spark Streaming isn’t a Real Time solution because it is a micro batch 
solution. The smallest Window is 500ms.  This means that if your 
compute time is >> 500ms and/or  your event flow is >> 500ms this 
could work.
(e.g. 'real time' image processing on a system that is capturing 60FPS 
because the processing time is >> 500ms. )


So Spark Streaming wouldn’t be the best solution….

However, you can combine spark with other technologies like Storm, 
Akka, etc .. where you have continuous streaming.

So you could instantiate a spark context per worker in storm…

I think if there are no class collisions between Akka and Spark, you 
could use Akka, which may have a better potential for communication 
between workers.

So here you can handle CEP events.

HTH

On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh 
> wrote:


please see my other reply

Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com 




On 27 April 2016 at 10:40, Esa Heikkinen 
> 
wrote:


Hi

I have followed with interest the discussion about CEP and Spark.
It is quite close to my research, which is a complex analyzing
for log files and "history" data  (not actually for real time
streams).

I have few questions:

1) Is CEP only for (real time) stream data and not for "history"
data?

2) Is it possible to search "backward" (upstream) by CEP with
given time window? If a start time of the time window is earlier
than the current stream time.

3) Do you know any good tools or softwares for "CEP's" using for
log data ?

4) Do you know any good (scientific) papers i should read about CEP ?


Regards
PhD student at Tampere University of Technology, Finland,
www.tut.fi 
Esa Heikkinen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





The opinions expressed here are mine, while they may reflect a 
cognitive thought, that is purely accidental.

Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com 









Re: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Mich Talebzadeh
couple of things.

There is no such thing as Continuous Data Streaming as there is no such
thing as Continuous Availability.

There is such thing as Discrete Data Streaming and  High Availability  but
they reduce the finite unavailability to minimum. In terms of business
needs a 5 SIGMA is good enough and acceptable. Even the candles set to a
predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader
makes a sell or buy decision on the basis of 2 seconds candlestick

The calculation itself in measurements is subject to finite error as
defined by their Confidence Level (CL) using Standard Deviation function.

OK so far I have never noticed a tool that requires that details of
granularity. Those stuff from Flink etc is in practical term is of little
value and does not make commercial sense.

Now with regard to your needs, Spark micro batching is perfectly adequate.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 April 2016 at 22:10, Esa Heikkinen 
wrote:

>
> Hi
>
> Thanks for the answer.
>
> I have developed a log file analyzer for RTPIS (Real Time Passenger
> Information System) system, where buses drive lines and the system try to
> estimate the arrival times to the bus stops. There are many different log
> files (and events) and analyzing situation can be very complex. Also
> spatial data can be included to the log data.
>
> The analyzer also has a query (or analyzing) language, which describes a
> expected behavior. This can be a requirement of system. Analyzer can be
> think to be also a test oracle.
>
> I have published a paper (SPLST'15 conference) about my analyzer and its
> language. The paper is maybe too technical, but it is found:
> http://ceur-ws.org/Vol-1525/paper-19.pdf
>
> I do not know yet where it belongs. I think it can be some "CEP with
> delays". Or do you know better ?
> My analyzer can also do little bit more complex and time-consuming
> analyzings? There are no a need for real time.
>
> And it is possible to do "CEP with delays" reasonably some existing
> analyzer (for example Spark) ?
>
> Regards
> PhD student at Tampere University of Technology, Finland, www.tut.fi
> Esa Heikkinen
>
> 27.4.2016, 15:51, Michael Segel kirjoitti:
>
> Spark and CEP? It depends…
>
> Ok, I know that’s not the answer you want to hear, but its a bit more
> complicated…
>
> If you consider Spark Streaming, you have some issues.
> Spark Streaming isn’t a Real Time solution because it is a micro batch
> solution. The smallest Window is 500ms.  This means that if your compute
> time is >> 500ms and/or  your event flow is >> 500ms this could work.
> (e.g. 'real time' image processing on a system that is capturing 60FPS
> because the processing time is >> 500ms. )
>
> So Spark Streaming wouldn’t be the best solution….
>
> However, you can combine spark with other technologies like Storm, Akka,
> etc .. where you have continuous streaming.
> So you could instantiate a spark context per worker in storm…
>
> I think if there are no class collisions between Akka and Spark, you could
> use Akka, which may have a better potential for communication between
> workers.
> So here you can handle CEP events.
>
> HTH
>
> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh 
> wrote:
>
> please see my other reply
>
> Dr Mich Talebzadeh
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 27 April 2016 at 10:40, Esa Heikkinen 
> wrote:
>
>> Hi
>>
>> I have followed with interest the discussion about CEP and Spark. It is
>> quite close to my research, which is a complex analyzing for log files and
>> "history" data  (not actually for real time streams).
>>
>> I have few questions:
>>
>> 1) Is CEP only for (real time) stream data and not for "history" data?
>>
>> 2) Is it possible to search "backward" (upstream) by CEP with given time
>> window? If a start time of the time window is earlier than the current
>> stream time.
>>
>> 3) Do you know any good tools or softwares for "CEP's" using for log data
>> ?
>>
>> 4) Do you know any good (scientific) papers i should read about CEP ?
>>
>>
>> Regards
>> PhD student at Tampere University of Technology, Finland, www.tut.fi
>> Esa Heikkinen
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> 

Re: EOFException while reading from HDFS

2016-04-27 Thread Bibudh Lahiri
Hi,
  I installed Hadoop 2.6.0 today on one of the machines (172.26.49.156),
got HDFS running on it (both Namenode and Datanode on the same machine) and
copied the files to HDFS. However, from the same machine, when I try to
load the same CSV with the following statement:

  sqlContext.read.format("com.databricks.spark.csv").option("header",
"false").load("hdfs://
172.26.49.156:54310/bibudh/healthcare/data/cloudera_challenge/patients.csv")

 I get the error

java.net.ConnectException: Call From impetus-i0276.impetus.co.in/127.0.0.1
to impetus-i0276:54310 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused

  I have changed the port number to 8020 but the same error gets reported.

  Even the following command is not working from the command line, when
launched from the HADOOP_HOME folder for

  bin/hdfs dfs -ls hdfs://172.26.49.156:54310/

  which was working earlier when issued from the other machine
(172.26.49.55), from under HADOOP_HOME for Hadoop 1.0.4.

  I set ~/.bashrc are as follows, when I installed Hadoop 2.6.0:

  export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export YARN_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export SPARK_HOME=/home/impadmin/spark-1.6.0-bin-hadoop2.6
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

  Am I getting the port number wrong, or is it some other config param that
I should check? What's the general rule here?

Thanks
  Bibudh

On Tue, Apr 26, 2016 at 7:51 PM, Davies Liu  wrote:

> The Spark package you are using is packaged with Hadoop 2.6, but the
> HDFS is Hadoop 1.0.4, they are not compatible.
>
> On Tue, Apr 26, 2016 at 11:18 AM, Bibudh Lahiri 
> wrote:
> > Hi,
> >   I am trying to load a CSV file which is on HDFS. I have two machines:
> > IMPETUS-1466 (172.26.49.156) and IMPETUS-1325 (172.26.49.55). Both have
> > Spark 1.6.0 pre-built for Hadoop 2.6 and later, but for both, I had
> existing
> > Hadoop clusters running Hadoop 1.0.4. I have launched HDFS from
> > 172.26.49.156 by running start-dfs.sh from it, copied files from local
> file
> > system to HDFS and can view them by hadoop fs -ls.
> >
> >   However, when I am trying to load the CSV file from pyspark shell
> > (launched by bin/pyspark --packages com.databricks:spark-csv_2.10:1.3.0)
> > from IMPETUS-1325 (172.26.49.55) with the following commands:
> >
> >
> >>>from pyspark.sql import SQLContext
> >
> >>>sqlContext = SQLContext(sc)
> >
> >>>patients_df =
> >>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>> "false").load("hdfs://
> 172.26.49.156:54310/bibudh/healthcare/data/cloudera_challenge/patients.csv
> ")
> >
> >
> > I get the following error:
> >
> >
> > java.io.EOFException: End of File Exception between local host is:
> > "IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55"; destination host is:
> > "IMPETUS-1466":54310; : java.io.EOFException; For more details see:
> > http://wiki.apache.org/hadoop/EOFException
> >
> >
> > U have changed the port number from 54310 to 8020, but then I get the
> error
> >
> >
> > java.net.ConnectException: Call From
> IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55
> > to IMPETUS-1466:8020 failed on connection exception:
> > java.net.ConnectException: Connection refused; For more details see:
> > http://wiki.apache.org/hadoop/ConnectionRefused
> >
> >
> > To me it seemed like this may result from a version mismatch between
> Spark
> > Hadoop client and my Hadoop cluster, so I have made the following
> changes:
> >
> >
> > 1) Added the following lines to conf/spark-env.sh
> >
> >
> > export HADOOP_HOME="/usr/local/hadoop-1.0.4" export
> > HADOOP_CONF_DIR="$HADOOP_HOME/conf" export
> > HDFS_URL="hdfs://172.26.49.156:8020"
> >
> >
> > 2) Downloaded Spark 1.6.0, pre-built with user-provided Hadoop, and in
> > addition to the three lines above, added the following line to
> > conf/spark-env.sh
> >
> >
> > export SPARK_DIST_CLASSPATH="/usr/local/hadoop-1.0.4/bin/hadoop"
> >
> >
> > but none of it seems to work. However, the following command works from
> > 172.26.49.55 and gives the directory listing:
> >
> > /usr/local/hadoop-1.0.4/bin/hadoop fs -ls hdfs://172.26.49.156:54310/
> >
> >
> > Any suggestion?
> >
> >
> > Thanks
> >
> > Bibudh
> >
> >
> > --
> > Bibudh Lahiri
> > Data Scientist, Impetus Technolgoies
> > 5300 Stevens Creek Blvd
> > San Jose, CA 95129
> > http://knowthynumbers.blogspot.com/
> >
>
> 

Re: what should I do when spark ut hang?

2016-04-27 Thread Ted Yu
Did you have a chance to take jstack when VersionsSuite was running ?

You can use the following command to run the test:

sbt/sbt "test-only org.apache.spark.sql.hive.client.VersionsSuite"

On Wed, Apr 27, 2016 at 9:01 PM, Demon King  wrote:

> Hi, all:
>I compile spark-1.6.1  in redhat 5.7(I have installed
> spark-1.6.0-cdh5.7.0 hive-1.1.0+cdh5.7.0 and hadoop-2.6.0+cdh5.7.0 in this
> machine). my compile cmd is:
>
> build/mvn --force -Psparkr -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver
>
>  when I use this cmd to run unit test. test hanged for a whole
> night(nearly 10 hour):
>
>  build/mvn --force -Psparkr -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver  test
>
>  That is the last log:
>
> Discovery completed in 25 seconds, 803 milliseconds.
> Run starting. Expected test count is: 1716
> StatisticsSuite:
> - parse analyze commands
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> log4j:WARN No appenders could be found for logger (hive.log).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> - analyze MetastoreRelations
> - estimates the size of a test MetastoreRelation
> - auto converts to broadcast hash join, by size estimate of a relation
> - auto converts to broadcast left semi join, by size estimate of a relation
> VersionsSuite:
>
>
>  I
> modify 
> ./sql/hive/target/scala-2.10/test-classes/org/apache/spark/sql/hive/client/VersionsSuite.class
> and add print statement in each beginning of function, but no message
> print. Now what should I do to solve this problem?
>
>  Thank you!
>
>


insert into a partition table take a long time

2016-04-27 Thread 谭成灶

 Hello Sir/Madam
   I want to insert into a partition table using dynamic partition (about 300G 
,dst table created in a orc format), 
   but in stage "get_partition_with_auth" take a long time ,
   while I  have set 

hive.exec.dynamic.partition=true 

hive.exec.dynamic.partition.mode="nonstrict"
   
   The following is my environment:
   hadoop2.5.0CDH5.2.1
   hive 0.13.1
   spark-1.6.1-bin-2.5.0-cdh5.2.1(I have recompiled,but hive.version=1.2.1 )
   
   I found a issue: https://issues.apache.org/jira/browse/SPARK-11785
  When deployed against remote Hive metastore, execution Hive client points 
to the actual Hive metastore rather than local execution Derby metastore using 
Hive 1.2.1 libraries delivered together with Spark (SPARK-11783).
JDBC calls are not properly dispatched to metastore Hive client in Thrift 
server, but handled by execution Hive. (SPARK-9686).
When a JDBC call like getSchemas() comes, execution Hive client using a 
higher version (1.2.1) is used to talk to a lower version Hive metastore 
(0.13.1). Because of incompatible changes made between these two versions, the 
Thrift RPC call fails and exceptions are thrown.
  
   when I run bin/spark-sql ,here is info:
   16/04/28 11:08:59 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
underlying DB is DERBY
   16/04/28 11:08:59 INFO metastore.ObjectStore: Initialized ObjectStore
   16/04/28 11:08:59 WARN metastore.ObjectStore: Version information not found 
in metastore. hive.metastore.schema.verification is not enabled so recording 
the schema version 1.2.0
   16/04/28 11:08:59 WARN metastore.ObjectStore: Failed to get database 
default, returning NoSuchObjectException
   16/04/28 11:08:59 INFO metastore.HiveMetaStore: Added admin role in metastore
   16/04/28 11:08:59 INFO metastore.HiveMetaStore: Added public role in 
metastore
   16/04/28 11:09:00 INFO metastore.HiveMetaStore: No user is added in admin 
role, since config is empty
   16/04/28 11:09:00 INFO metastore.HiveMetaStore: 0: get_all_databases
   16/04/28 11:09:00 INFO HiveMetaStore.audit: ugi=ocdcip=unknown-ip-addr   
   cmd=get_all_databases
   16/04/28 11:09:00 INFO metastore.HiveMetaStore: 0: get_functions: db=default 
pat=*
   16/04/28 11:09:00 INFO HiveMetaStore.audit: ugi=ocdcip=unknown-ip-addr   
   cmd=get_functions: db=default pat=*
   
   
So can you suggest me the any optimized way ,or may I have to upgrate 
hadoop and hive version ?

 Thanks

Spark executor crashes when the tasks are cancelled

2016-04-27 Thread Kiran Chitturi
Hi,

We are seeing this issue with Spark 1.6.1. The executor is exiting when one
of the running tasks is cancelled.

The executor logs is showing the below error and crashing.

16/04/27 16:34:13 ERROR SparkUncaughtExceptionHandler: [Container in
> shutdown] Uncaught exception in thread Thread[Executor task launch
> worker-2,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
> at
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
> at
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
> at
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
> at
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
> at
> org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
> at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
> at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:252)
> at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:195)
> at
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:132)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:288)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more


I have attached the full logs at this gist:
https://gist.github.com/kiranchitturi/3bd3a083a7c956cff73040c1a140c88f

On the driver side, the following info is logged (
https://gist.github.com/kiranchitturi/3bd3a083a7c956cff73040c1a140c88f)

The following lines show that Executor exited because of the running tasks

2016-04-27T16:34:13,723 - WARN [dispatcher-event-loop-1:Logging$class@70] -
> Lost task 0.0 in stage 89.0 (TID 173, 10.0.0.42): ExecutorLostFailure
> (executor 2 exited caused by one of the running tasks) Reason: Remote RPC
> client di
> sassociated. Likely due to containers exceeding thresholds, or network
> issues. Check driver logs for WARN messages.
> 2016-04-27T16:34:13,723 - WARN [dispatcher-event-loop-1:Logging$class@70]
> - Lost task 1.0 in stage 89.0 (TID 174, 10.0.0.42): ExecutorLostFailure
> (executor 2 exited caused by one of the running tasks) Reason: Remote RPC
> client di


Is it possible for executor to die when the jobs in the sparkContext are
cancelled ? Apart from https://issues.apache.org/jira/browse/SPARK-14234, I
could not find any Jiras that report this error.

Sometimes, we notice a scenario where the executor dies and driver doesn't
request for a new one. This causes the jobs to hang indefinitely. We are
using dynamic allocation for our jobs.

Thanks,


>
Kiran Chitturi


Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel

Doh! 
Wrong email account again! 

> Begin forwarded message:
> 
> From: Michael Segel 
> Subject: Re: Spark support for Complex Event Processing (CEP)
> Date: April 27, 2016 at 7:16:55 PM CDT
> To: Mich Talebzadeh 
> Cc: Esa Heikkinen , "user@spark" 
> 
> 
> Uhm… 
> I think you need to clarify a couple of things…
> 
> First there is this thing called analog signal processing…. Is that 
> continuous enough for you? 
> 
> But more to the point, Spark Streaming does micro batching so if you’re 
> processing a continuous stream of tick data, you will have more than 50K of 
> tics per second while there are markets open and trading.  Even at 50K a 
> second, that would mean 1 every .02 ms or 50 ticks a ms. 
> 
> And you don’t want to wait until you have a batch to start processing, but 
> you want to process when the data hits the queue and pull it from the queue 
> as quickly as possible. 
> 
> Spark streaming will be able to pull batches in as little as 500ms. So if you 
> pull a batch at t0 and immediately have a tick in your queue, you won’t 
> process that data until t0+500ms. And said batch would contain 25,000 
> entries. 
> 
> Depending on what you are doing… that 500ms delay can be enough to be fatal 
> to your trading process. 
> 
> If you don’t like stock data, there are other examples mainly when pulling 
> data from real time embedded systems. 
> 
> 
> If you go back and read what I said, if your data flow is >> (much slower) 
> than 500ms, and / or the time to process is >> 500ms ( much longer )  you 
> could use spark streaming.  If not… and there are applications which require 
> that type of speed…  then you shouldn’t use spark streaming. 
> 
> If you do have that constraint, then you can look at systems like 
> storm/flink/samza / whatever where you have a continuous queue and listener 
> and no micro batch delays.
> Then for each bolt (storm) you can have a spark context for processing the 
> data. (Depending on what sort of processing you want to do.) 
> 
> To put this in perspective… if you’re using spark streaming / akka / storm 
> /etc to handle real time requests from the web, 500ms added delay can be a 
> long time. 
> 
> Choose the right tool. 
> 
> For the OP’s problem. Sure Tracking public transportation could be done using 
> spark streaming. It could also be done using half a dozen other tools because 
> the rate of data generation is much slower than 500ms. 
> 
> HTH
> 
> 
>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh > > wrote:
>> 
>> couple of things.
>> 
>> There is no such thing as Continuous Data Streaming as there is no such 
>> thing as Continuous Availability.
>> 
>> There is such thing as Discrete Data Streaming and  High Availability  but 
>> they reduce the finite unavailability to minimum. In terms of business needs 
>> a 5 SIGMA is good enough and acceptable. Even the candles set to a 
>> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader 
>> makes a sell or buy decision on the basis of 2 seconds candlestick
>> 
>> The calculation itself in measurements is subject to finite error as defined 
>> by their Confidence Level (CL) using Standard Deviation function.
>> 
>> OK so far I have never noticed a tool that requires that details of 
>> granularity. Those stuff from Flink etc is in practical term is of little 
>> value and does not make commercial sense.
>> 
>> Now with regard to your needs, Spark micro batching is perfectly adequate.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 27 April 2016 at 22:10, Esa Heikkinen > > wrote:
>> 
>> Hi
>> 
>> Thanks for the answer.
>> 
>> I have developed a log file analyzer for RTPIS (Real Time Passenger 
>> Information System) system, where buses drive lines and the system try to 
>> estimate the arrival times to the bus stops. There are many different log 
>> files (and events) and analyzing situation can be very complex. Also spatial 
>> data can be included to the log data.
>> 
>> The analyzer also has a query (or analyzing) language, which describes a 
>> expected behavior. This can be a requirement of system. Analyzer can be 
>> think to be also a test oracle.
>> 
>> I have published a paper (SPLST'15 conference) about my analyzer and its 
>> language. The paper is maybe too technical, but it is found:
>> http://ceur-ws.org/Vol-1525/paper-19.pdf 
>> 
>> 
>> I do not know yet where it belongs. I think it can be some "CEP with 
>> delays". Or do 

Re: Compute

2016-04-27 Thread Karl Higley
One idea is to avoid materializing the pairs of points before computing the
distances between them. You could do that using the LSH signatures by
building (Signature, (Int, Vector)) tuples, grouping by signature, and then
iterating pairwise over the resulting lists of points to compute the
distances between them. The points still have to be shuffled over the
network, but at least the shuffle doesn't create multiple copies of each
point (like a join by point ids would).

Here's an implementation of that idea in the context of finding nearest
neighbors:
https://github.com/karlhigley/spark-neighbors/blob/master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala#L33-L34

Best,
Karl



On Wed, Apr 27, 2016 at 10:22 PM nguyen duc tuan 
wrote:

> Hi all,
> Currently, I'm working on implementing LSH on spark. The problem leads to
> follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of
> vectors need to compute distance and an other RDD[(Int, Vector)] stores all
> vectors with their ids. Can anyone  suggest an efficiency way to compute
> distance? My simple version that I try first is as follows but it's
> inefficient because it require a lot of shuffling data over the network.
>
> rdd1: RDD[(Int, Int)] = ..
> rdd2: RDD[(Int, Vector)] = ...
> val distances = rdd2.cartesian(rdd2)
>   .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2)))
>   .join(rdd1.map(x => (x, 1))
>   .mapValues(x => {
>  measure.compute(x._1._1, x._1._2)
>   })
>
> Thanks for any suggestion.
>


Re: Compute

2016-04-27 Thread nguyen duc tuan
I see this implementation before. The problem here is that: If after
several hashes, if a pair of points appears K times in a bucket (with
respect to K hashes), the distance needs to be computed K times, and total
the data needs to shuffled will upto K times. So it deduce to my problem.
I'm trying new approach and I think It will be better than my original
approach:
val part1 = rdd1.map(x => (x._1, x)).join(rdd2).map(_._2)
val part2 = rdd2.map(x => (x._2, x)).join(rdd2).map(_._2)
val distances = part1.join(part2).mapValues(v => measure.compute(v._1,
v._2))

And I'm sorry for uggly title of email. I forgot to check it before send.

2016-04-28 10:10 GMT+07:00 Karl Higley :

> One idea is to avoid materializing the pairs of points before computing
> the distances between them. You could do that using the LSH signatures by
> building (Signature, (Int, Vector)) tuples, grouping by signature, and then
> iterating pairwise over the resulting lists of points to compute the
> distances between them. The points still have to be shuffled over the
> network, but at least the shuffle doesn't create multiple copies of each
> point (like a join by point ids would).
>
> Here's an implementation of that idea in the context of finding nearest
> neighbors:
>
> https://github.com/karlhigley/spark-neighbors/blob/master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala#L33-L34
>
> Best,
> Karl
>
>
>
> On Wed, Apr 27, 2016 at 10:22 PM nguyen duc tuan 
> wrote:
>
>> Hi all,
>> Currently, I'm working on implementing LSH on spark. The problem leads to
>> follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of
>> vectors need to compute distance and an other RDD[(Int, Vector)] stores all
>> vectors with their ids. Can anyone  suggest an efficiency way to compute
>> distance? My simple version that I try first is as follows but it's
>> inefficient because it require a lot of shuffling data over the network.
>>
>> rdd1: RDD[(Int, Int)] = ..
>> rdd2: RDD[(Int, Vector)] = ...
>> val distances = rdd2.cartesian(rdd2)
>>   .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2)))
>>   .join(rdd1.map(x => (x, 1))
>>   .mapValues(x => {
>>  measure.compute(x._1._1, x._1._2)
>>   })
>>
>> Thanks for any suggestion.
>>
>


Re: Cant join same dataframe twice ?

2016-04-27 Thread Divya Gehlot
 when working with Dataframes and using explain to debug I observed that
Spark gives  different tagging number for the same dataframe columns
Like in this case
val df1 = df2.join(df3,"Column1")
Below throwing error missing columns
val df 4 = df1.join(df3,"Column2")

For instance,df2 has 2 columns ,df2 columns gets tagging like df2Col1#4
,df2Col2#5
   df3 has 4 columns ,df3 columns gets tagging like
df3Col1#6,df3Col2#7,df3Col3#8,df3Col4#9
Now after joining df1 columns tagging will be
df2Co1l#10,df2Col2#11,df3Col1#12,df3Col2#13,df3Col3#14,df3Col4#15

Now when df1 again with df3 the df3 columns tagging changed
 df2Co1l#16,df2Col2#17,df3Col1#18
,df3Col2#19,df3Col3#20,df3Col4#21,df3Col2#23,df3Col3#24,df3Col4#25

but joining df3Col1#12  would be referring to the previous dataframe and
that causes the issue .

Thanks,
Divya






On 27 April 2016 at 23:55, Ted Yu  wrote:

> I wonder if Spark can provide better support for this case.
>
> The following schema is not user friendly (shown previsouly):
>
> StructField(b,IntegerType,false), StructField(b,IntegerType,false)
>
> Except for 'select *', there is no way for user to query any of the two
> fields.
>
> On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro 
> wrote:
>
>> Based on my example, how about renaming columns?
>>
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
>> df2("b").as("2-b"))
>> val df4 = df3.join(df2, df3("2-b") === df2("b"))
>>
>> // maropu
>>
>> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot 
>> wrote:
>>
>>> Correct Takeshi
>>> Even I am facing the same issue .
>>>
>>> How to avoid the ambiguity ?
>>>
>>>
>>> On 27 April 2016 at 11:54, Takeshi Yamamuro 
>>> wrote:
>>>
 Hi,

 I tried;
 val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
 val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
 val df3 = df1.join(df2, "a")
 val df4 = df3.join(df2, "b")

 And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
 ambiguous, could be: b#6, b#14.;
 If same case, this message makes sense and this is clear.

 Thought?

 // maropu







 On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla 
 wrote:

> Also, check the column names of df1 ( after joining df2 and df3 ).
>
> Prasad.
>
> From: Ted Yu
> Date: Monday, April 25, 2016 at 8:35 PM
> To: Divya Gehlot
> Cc: "user @spark"
> Subject: Re: Cant join same dataframe twice ?
>
> Can you show us the structure of df2 and df3 ?
>
> Thanks
>
> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot  > wrote:
>
>> Hi,
>> I am using Spark 1.5.2 .
>> I have a use case where I need to join the same dataframe twice on
>> two different columns.
>> I am getting error missing Columns
>>
>> For instance ,
>> val df1 = df2.join(df3,"Column1")
>> Below throwing error missing columns
>> val df 4 = df1.join(df3,"Column2")
>>
>> Is the bug or valid scenario ?
>>
>>
>>
>>
>> Thanks,
>> Divya
>>
>
>


 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


Re: Compute

2016-04-27 Thread Karl Higley
You're right that there's some duplicate distance computations happening in
the implementation I mentioned. I ran into the kinds of issues you
describe, and I ended up accepting the duplicate computational work in
exchange for significantly reduced memory usage. I couldn't figure out a
way to avoid materializing the pairs (to save shuffle size and memory)
while also avoiding duplicate distance computations. I'd be very curious to
know if there is one, though!



On Wed, Apr 27, 2016, 11:22 PM nguyen duc tuan  wrote:

> I see this implementation before. The problem here is that: If after
> several hashes, if a pair of points appears K times in a bucket (with
> respect to K hashes), the distance needs to be computed K times, and total
> the data needs to shuffled will upto K times. So it deduce to my problem.
> I'm trying new approach and I think It will be better than my original
> approach:
> val part1 = rdd1.map(x => (x._1, x)).join(rdd2).map(_._2)
> val part2 = rdd2.map(x => (x._2, x)).join(rdd2).map(_._2)
> val distances = part1.join(part2).mapValues(v => measure.compute(v._1,
> v._2))
>
> And I'm sorry for uggly title of email. I forgot to check it before send.
>
> 2016-04-28 10:10 GMT+07:00 Karl Higley :
>
>> One idea is to avoid materializing the pairs of points before computing
>> the distances between them. You could do that using the LSH signatures by
>> building (Signature, (Int, Vector)) tuples, grouping by signature, and then
>> iterating pairwise over the resulting lists of points to compute the
>> distances between them. The points still have to be shuffled over the
>> network, but at least the shuffle doesn't create multiple copies of each
>> point (like a join by point ids would).
>>
>> Here's an implementation of that idea in the context of finding nearest
>> neighbors:
>>
>> https://github.com/karlhigley/spark-neighbors/blob/master/src/main/scala/com/github/karlhigley/spark/neighbors/ANNModel.scala#L33-L34
>>
>> Best,
>> Karl
>>
>>
>>
>> On Wed, Apr 27, 2016 at 10:22 PM nguyen duc tuan 
>> wrote:
>>
>>> Hi all,
>>> Currently, I'm working on implementing LSH on spark. The problem leads
>>> to follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of
>>> vectors need to compute distance and an other RDD[(Int, Vector)] stores all
>>> vectors with their ids. Can anyone  suggest an efficiency way to compute
>>> distance? My simple version that I try first is as follows but it's
>>> inefficient because it require a lot of shuffling data over the network.
>>>
>>> rdd1: RDD[(Int, Int)] = ..
>>> rdd2: RDD[(Int, Vector)] = ...
>>> val distances = rdd2.cartesian(rdd2)
>>>   .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2)))
>>>   .join(rdd1.map(x => (x, 1))
>>>   .mapValues(x => {
>>>  measure.compute(x._1._1, x._1._2)
>>>   })
>>>
>>> Thanks for any suggestion.
>>>
>>
>


what should I do when spark ut hang?

2016-04-27 Thread Demon King
Hi, all:
   I compile spark-1.6.1  in redhat 5.7(I have installed
spark-1.6.0-cdh5.7.0 hive-1.1.0+cdh5.7.0 and hadoop-2.6.0+cdh5.7.0 in this
machine). my compile cmd is:

build/mvn --force -Psparkr -Pyarn -Phadoop-2.6
-Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver

 when I use this cmd to run unit test. test hanged for a whole
night(nearly 10 hour):

 build/mvn --force -Psparkr -Pyarn -Phadoop-2.6
-Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver  test

 That is the last log:

Discovery completed in 25 seconds, 803 milliseconds.
Run starting. Expected test count is: 1716
StatisticsSuite:
- parse analyze commands
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
log4j:WARN No appenders could be found for logger (hive.log).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
Warning: fs.defaultFS is not set when running "chgrp" command.
Warning: fs.defaultFS is not set when running "chmod" command.
- analyze MetastoreRelations
- estimates the size of a test MetastoreRelation
- auto converts to broadcast hash join, by size estimate of a relation
- auto converts to broadcast left semi join, by size estimate of a relation
VersionsSuite:


 I
modify 
./sql/hive/target/scala-2.10/test-classes/org/apache/spark/sql/hive/client/VersionsSuite.class
and add print statement in each beginning of function, but no message
print. Now what should I do to solve this problem?

 Thank you!


Compute

2016-04-27 Thread nguyen duc tuan
Hi all,
Currently, I'm working on implementing LSH on spark. The problem leads to
follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of
vectors need to compute distance and an other RDD[(Int, Vector)] stores all
vectors with their ids. Can anyone  suggest an efficiency way to compute
distance? My simple version that I try first is as follows but it's
inefficient because it require a lot of shuffling data over the network.

rdd1: RDD[(Int, Int)] = ..
rdd2: RDD[(Int, Vector)] = ...
val distances = rdd2.cartesian(rdd2)
  .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2)))
  .join(rdd1.map(x => (x, 1))
  .mapValues(x => {
 measure.compute(x._1._1, x._1._2)
  })

Thanks for any suggestion.


AuthorizationException while exposing via JDBC client (beeline)

2016-04-27 Thread ram kumar
Hi,

I wrote a spark job which registers a temp table
and when I expose it via beeline (JDBC client)

$ *./bin/beeline*
beeline>
* !connect jdbc:hive2://IP:10003 -n ram -p *0: jdbc:hive2://IP>






*show
tables;+-+--+-+|
tableName  | isTemporary
|+-+--+-+|
f238| true

|+-+--+-+2
rows selected (0.309 seconds)*0: jdbc:hive2://IP>

I can view the table. When querying I get this error message

0: jdbc:hive2://IP> select * from f238;
*Error:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
User: ram is not allowed to impersonate ram (state=,code=0)*
0: jdbc:hive2://IP>

I have this in hive-site.xml,


  hive.metastore.sasl.enabled
  false
  If true, the metastore Thrift interface will be secured
with SASL. Clients must authenticate with Kerberos.



  hive.server2.enable.doAs
  false



  hive.server2.authentication
  NONE



I have this in core-site.xml,


  hadoop.proxyuser.hive.groups
  *



  hadoop.proxyuser.hive.hosts
  *



When persisting as a table using saveAsTable, I can able to query via
beeline
Any idea what configuration I am missing?

Thanks


Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim
Daniel,

If you can get the code snippet, that would be great! I’ve been trying to get 
it to work for me as well. The examples on the Phoenix website do not work for 
me. If you are willing to also, can you include your setup to make Phoenix work 
with Spark?

Thanks,
Ben

> On Apr 27, 2016, at 11:46 AM, Paras sachdeva  
> wrote:
> 
> Hi Daniel,
> 
> Would you possibly be able to share the snipped to code you have used ?
> 
> Thank you.
> 
> On Wed, Apr 27, 2016 at 3:13 PM, Daniel Haviv 
> > 
> wrote:
> Hi Benjamin,
> Yes it should work.
> 
> Let me know if you need further assistance I might be able to get the code 
> I've used for that project.
> 
> Thank you.
> Daniel
> 
> On 24 Apr 2016, at 17:35, Benjamin Kim  > wrote:
> 
>> Hi Daniel,
>> 
>> How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed which 
>> comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?
>> 
>> Thanks,
>> Ben
>> 
>>> On Apr 24, 2016, at 1:43 AM, Daniel Haviv >> > wrote:
>>> 
>>> Hi,
>>> I tried saving DF to HBase using a hive table with hbase storage handler 
>>> and hiveContext but it failed due to a bug.
>>> 
>>> I was able to persist the DF to hbase using Apache Pheonix which was pretty 
>>> simple.
>>> 
>>> Thank you.
>>> Daniel
>>> 
>>> On 21 Apr 2016, at 16:52, Benjamin Kim >> > wrote:
>>> 
 Has anyone found an easy way to save a DataFrame into HBase?
 
 Thanks,
 Ben
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 
 For additional commands, e-mail: user-h...@spark.apache.org 
 
 
>> 
> 



spark-ts

2016-04-27 Thread Bhupendra Mishra
Guys, please help me with following question on Spark-TS liabrary

You’ve just acquired a new dataset showing the purchases of stock from
market resellers during the day over a ten month period. You’ve looked at
the daily data and have decided that you can model this using a time series
analysis. You use the library spark-ts to populate a time series RDD. How
can you model the data using the library? What does the d model?
 Use the ARIMA method and calculate p, d and q; and stationarity Use the
ARIMA method and calculate p, d and q; and lag Use the ARIMA method and
calculate auto regression; and moving averages Use the ARMA method and
calculate p, d and q; and lag


Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim
Hi Ted,

Do you know when the release will be? I also see some documentation for usage 
of the hbase-spark module at the hbase website. But, I don’t see an example on 
how to save data. There is only one for reading/querying data. Will this be 
added when the final version does get released?

Thanks,
Ben

> On Apr 21, 2016, at 6:56 AM, Ted Yu  wrote:
> 
> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do 
> this.
> 
> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim  > wrote:
> Has anyone found an easy way to save a DataFrame into HBase?
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



getting ClassCastException when calling UDF

2016-04-27 Thread Divya Gehlot
Hi,
I am using Spark 1.5.2 and defined   below udf

import org.apache.spark.sql.functions.udf
> val myUdf  = (wgts : Int , amnt :Float) => {
> (wgts*amnt)/100.asInstanceOf[Float]
> }
>



val df2 = df1.withColumn("WEIGHTED_AMOUNT",callUDF(udfcalWghts,
FloatType,col("RATE"),col("AMOUNT")))

In my schema RATE is in integerType and Amount FLOATTYPE

I am getting below error for

> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 106 in stage 89.0 failed 4 times, most recent failure: Lost task 106.3 in
> stage 89.0 (TID 7735, ip-xx-xx-xx-xxx.ap-southeast-1.compute.internal):
> java.lang.ClassCastException: java.lang.Double cannot be cast to
> java.lang.Float
> at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(https://github.com/EclairJS/eclairjs-nashorn/issues/3


Can somebody help me with the resolution ?











Thanks,
Divya


DataFrame to DataSet without Predefined Class

2016-04-27 Thread Brandon White
I am reading parquet files into a dataframe. The schema varies depending on
the data so I have no way to write a predefined class.

Is there any way to go from DataFrame to DataSet without predefined a case
class? Can I build a class from my dataframe schema?


RE: removing header from csv file

2016-04-27 Thread Mishra, Abhishek
You should be doing something like this:


data = sc.textFile('file:///path1/path/test1.csv')
header = data.first() #extract header
#print header
data = data.filter(lambda x:x !=header)
#print data
Hope it helps.

Sincerely,
Abhishek
+91-7259028700

From: nihed mbarek [mailto:nihe...@gmail.com]
Sent: Wednesday, April 27, 2016 11:29 AM
To: Divya Gehlot
Cc: Ashutosh Kumar; user @spark
Subject: Re: removing header from csv file

You can add a filter with string that you are sure available only in the header

Le mercredi 27 avril 2016, Divya Gehlot 
> a écrit :
yes you can remove the headers by removing the first row

can first() or head() to do that


Thanks,
Divya

On 27 April 2016 at 13:24, Ashutosh Kumar 
>
 wrote:
I see there is a library spark-csv which can be used for removing header and 
processing of csv files. But it seems it works with sqlcontext only. Is there a 
way to remove header from csv files without sqlcontext ?
Thanks
Ashutosh



--

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

[http://www.linkedin.com/img/webpromo/btn_myprofile_160x33_fr_FR.png]



Re: removing header from csv file

2016-04-27 Thread Hyukjin Kwon
There are two ways to do so.


Firstly, this way will make sure cleanly it skips the header. But of course
the use of mapWithIndex decreases performance

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else
iter }


Secondly, you can do

val header = rdd.first()
val data = rdd.filter(_ != first)

For the second method, this does not make sure it will only skip the first
because there might be the exactly same records with the header.


CSV data source uses the second way so I gave a todo in the PR I recently
opened.



2016-04-27 14:59 GMT+09:00 nihed mbarek :

> You can add a filter with string that you are sure available only in the
> header
>
>
> Le mercredi 27 avril 2016, Divya Gehlot  a
> écrit :
>
>> yes you can remove the headers by removing the first row
>>
>> can first() or head() to do that
>>
>>
>> Thanks,
>> Divya
>>
>> On 27 April 2016 at 13:24, Ashutosh Kumar 
>> wrote:
>>
>>> I see there is a library spark-csv which can be used for removing header
>>> and processing of csv files. But it seems it works with sqlcontext only. Is
>>> there a way to remove header from csv files without sqlcontext ?
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>
>


Re: removing header from csv file

2016-04-27 Thread Marco Mistroni
If u r using Scala api you can do
Myrdd.zipwithindex.filter(_._2 >0).map(_._1)

Maybe a little bit complicated but will do the trick
As per spark CSV, you will get back a data frame which you can reconduct to
rdd. .
Hth
Marco
On 27 Apr 2016 6:59 am, "nihed mbarek"  wrote:

> You can add a filter with string that you are sure available only in the
> header
>
> Le mercredi 27 avril 2016, Divya Gehlot  a
> écrit :
>
>> yes you can remove the headers by removing the first row
>>
>> can first() or head() to do that
>>
>>
>> Thanks,
>> Divya
>>
>> On 27 April 2016 at 13:24, Ashutosh Kumar 
>> wrote:
>>
>>> I see there is a library spark-csv which can be used for removing header
>>> and processing of csv files. But it seems it works with sqlcontext only. Is
>>> there a way to remove header from csv files without sqlcontext ?
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>
>


Re: removing header from csv file

2016-04-27 Thread Nachiketa
Why "without sqlcontext"  ? Could you please describe what is it that you
are trying to accomplish ? Thanks.

Regards,
Nachiketa

On Wed, Apr 27, 2016 at 10:54 AM, Ashutosh Kumar 
wrote:

> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>



-- 
Regards,
-- Nachiketa


Re: error: value toDS is not a member of Seq[Int] SQL

2016-04-27 Thread Sachin Aggarwal
for me it works, with out making any other change, try importing
import sqlContext.implicits._
otherwise verify if u are able to run other functions or u have some issue
with ur setup.

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
  /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/27 16:07:23 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
16/04/27 16:07:23 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
16/04/27 16:07:29 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 1.2.0
16/04/27 16:07:29 WARN ObjectStore: Failed to get database default,
returning NoSuchObjectException
16/04/27 16:07:29 WARN : Your hostname, sachins-MacBook-Pro-2.local
resolves to a loopback/non-reachable address:
fe80:0:0:0:288b:f5ff:fe80:367e%awdl0, but we couldn't find any external IP
address!
16/04/27 16:07:36 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
16/04/27 16:07:36 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala>

scala> val ds = Seq(1, 2, 3).toDS()
ds: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ds.map(_ + 1).collect() // Returns: Array(2, 3, 4)
res0: Array[Int] = Array(2, 3, 4)

On Wed, Apr 27, 2016 at 4:01 PM, shengshanzhang 
wrote:

> 1.6.1
>
> 在 2016年4月27日,下午6:28,Sachin Aggarwal  写道:
>
> what is ur spark version?
>
> On Wed, Apr 27, 2016 at 3:12 PM, shengshanzhang  > wrote:
>
>> Hi,
>>
>> On spark website, there is code as follows showing how to create datasets.
>> 
>>
>> However when i input this line into spark-shell,there comes a Error, and
>> who can tell me Why and how to fix this?
>>
>> scala> val ds = Seq(1, 2, 3).toDS()
>> :35: error: value toDS is not a member of Seq[Int]
>>val ds = Seq(1, 2, 3).toDS()
>>
>>
>>
>> Thank you a lot!
>>
>
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772
>
>
>


-- 

Thanks & Regards

Sachin Aggarwal
7760502772


Re: Spark 1.6.1 throws error: Did not find registered driver with class oracle.jdbc.OracleDriver

2016-04-27 Thread Jeff Zhang
Could you check the log of executor to find the full stack trace ?

On Tue, Apr 26, 2016 at 12:30 AM, Mich Talebzadeh  wrote:

> Hi,
>
> This JDBC connection was working fine in Spark 1.5,2
>
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val sqlContext = new HiveContext(sc)
> println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
> //
> var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb"
> var _username : String = "scratchpad"
> var _password : String = "xxx"
> //
> val s = HiveContext.load("jdbc",
> Map("url" -> _ORACLEserver,
> "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
> "user" -> _username,
> "password" -> _password))
>
> s.toDF.registerTempTable("tmp")
>
>
> // Need to create and populate target ORC table sales in database test in
> Hive
> //
> HiveContext.sql("use test")
> //
> // Drop and create table
> //
> HiveContext.sql("DROP TABLE IF EXISTS test.dummy2")
> var sqltext : String = ""
> sqltext = """
> CREATE TABLE test.dummy2
>  (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
> )
> CLUSTERED BY (ID) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="ID",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="1" )
> """
> HiveContext.sql(sqltext)
> //
> sqltext = """
> INSERT INTO TABLE test.dummy2
> SELECT
> *
> FROM tmp
> """
> HiveContext.sql(sqltext)
>
> In Spark 1.6.1, it is throwing error as below
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 1.0 (TID 4, rhes564): java.lang.IllegalStateException: Did not find
> registered driver with class oracle.jdbc.OracleDriver
>
> Is this a new bug introduced in Spark 1.6.1?
>
>
> Thanks
>



-- 
Best Regards

Jeff Zhang


Re: error: value toDS is not a member of Seq[Int] SQL

2016-04-27 Thread shengshanzhang
1.6.1
> 在 2016年4月27日,下午6:28,Sachin Aggarwal  写道:
> 
> what is ur spark version?
> 
> On Wed, Apr 27, 2016 at 3:12 PM, shengshanzhang  > wrote:
> Hi,
> 
>   On spark website, there is code as follows showing how to create 
> datasets.
>  
> 
>   However when i input this line into spark-shell,there comes a Error, 
> and who can tell me Why and how to fix this?
> 
> scala> val ds = Seq(1, 2, 3).toDS()
> :35: error: value toDS is not a member of Seq[Int]
>val ds = Seq(1, 2, 3).toDS()
> 
> 
> 
>   Thank you a lot!
> 
> 
> 
> -- 
> 
> Thanks & Regards
> 
> Sachin Aggarwal
> 7760502772



Re: error: value toDS is not a member of Seq[Int] SQL

2016-04-27 Thread Ted Yu
Did you do the import as the first comment shows ?

> On Apr 27, 2016, at 2:42 AM, shengshanzhang  wrote:
> 
> Hi,
> 
>   On spark website, there is code as follows showing how to create 
> datasets.
>  
> 
>   However when i input this line into spark-shell,there comes a Error, 
> and who can tell me Why and how to fix this?
> 
> scala> val ds = Seq(1, 2, 3).toDS()
> :35: error: value toDS is not a member of Seq[Int]
>val ds = Seq(1, 2, 3).toDS()
> 
> 
> 
>   Thank you a lot!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: error: value toDS is not a member of Seq[Int] SQL

2016-04-27 Thread Sachin Aggarwal
what is ur spark version?

On Wed, Apr 27, 2016 at 3:12 PM, shengshanzhang 
wrote:

> Hi,
>
> On spark website, there is code as follows showing how to create datasets.
>
>
> However when i input this line into spark-shell,there comes a Error, and
> who can tell me Why and how to fix this?
>
> scala> val ds = Seq(1, 2, 3).toDS()
> :35: error: value toDS is not a member of Seq[Int]
>val ds = Seq(1, 2, 3).toDS()
>
>
>
> Thank you a lot!
>



-- 

Thanks & Regards

Sachin Aggarwal
7760502772


Re: user Digest 27 Apr 2016 09:42:14 -0000 Issue 6581

2016-04-27 Thread Mich Talebzadeh
Hi Esa,

I am trying to use Spark streaming for CEP and is looking promising.

*General (from my **blog *
)

In a nutshell CEP involves the continuous processing and analysis of
high-volume, high-speed data streams from inside and outside of an
enterprise to detect business-critical issues as they happen in real time.

Contrast this to the traditional processes involving database systems,
which provide delayed analysis. An example of CEP would be real-time
financial market data analysis and decision process allowing traders or
anyone else to make a decision on the spot based on real time data. Prime
example is a Forex System where on the basis of certain indicators (say
moving averages for the past 14 periods) you make a decision to buy or sell.

CEP software offers two major components: a high-level language for
programmers to easily describe how to process the streams/messages, and an
infrastructure engine for processing and analyzing high-volume data

Although CEP software performs different functions, the component structure
is somehow analogous to database software, where there is a language (say
SQL) and an engine (the database server). The objectives of CEP is to get
the product and save on development cycle traditionally done by in-house
developers.

Your points:

Is CEP only for (real time) stream data and not for "history" data?

Generally real time as you make decisions on the spot say buy or sell in
Forex or Algorithmic trading. From business point of view that is the most
valuable. You can of course store data of interest in a DW like Hive for
historical analysis (trend analytics).

2) Is it possible to search "backward" (upstream) by CEP with given time
window? If a start time of the time window is earlier than the current
stream time.

I haver not tried this but I believe it is feasible using something like
Kafka broker which retains events for a period of time. However, I don't
see much of business value for it.

3) Do you know any good tools or softwares for "CEP's" using for log data ?

What is the definition of log data here? Are you referring to producers
here?

4) Do you know any good (scientific) papers i should read about CEP ?

There are many references in web as Mario pointed out. However, as a
student you need to grasp the need for Complex Event Processing which is
Event Driven Architecture. A good classic book on this is "The Power of
Events" by David Luckham, ISBN 0-201-72789-7.

In general the architecture I am aiming is Kafka,Zookeeper and Spark
Streaming.


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 April 2016 at 11:02, Mario Ds Briggs  wrote:

> Wikipedia defines the goal of CEP as 'respond to them (events) as quickly
> as possible' . So i think there is an direct link to 'streaming', when we
> say CEP.
>
> However pattern matching at its core applies even on historical data.
>
>
> thanks
> Mario
>
>
> - Message from Esa Heikkinen  on Wed,
> 27 Apr 2016 12:40:52 +0300 -
>
> *To:*
> user@spark.apache.org
>
> *Subject:*
> Re: Spark support for Complex Event Processing (CEP)Hi
>
> I have followed with interest the discussion about CEP and Spark. It is
> quite close to my research, which is a complex analyzing for log files
> and "history" data  (not actually for real time streams).
>
> I have few questions:
>
> 1) Is CEP only for (real time) stream data and not for "history" data?
>
> 2) Is it possible to search "backward" (upstream) by CEP with given time
> window? If a start time of the time window is earlier than the current
> stream time.
>
> 3) Do you know any good tools or softwares for "CEP's" using for log data ?
>
> 4) Do you know any good (scientific) papers i should read about CEP ?
>
>
> Regards
> PhD student at Tampere University of Technology, Finland, www.tut.fi
> Esa Heikkinen
>
>
>
>
>


Re: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Mich Talebzadeh
please see my other reply

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 27 April 2016 at 10:40, Esa Heikkinen 
wrote:

> Hi
>
> I have followed with interest the discussion about CEP and Spark. It is
> quite close to my research, which is a complex analyzing for log files and
> "history" data  (not actually for real time streams).
>
> I have few questions:
>
> 1) Is CEP only for (real time) stream data and not for "history" data?
>
> 2) Is it possible to search "backward" (upstream) by CEP with given time
> window? If a start time of the time window is earlier than the current
> stream time.
>
> 3) Do you know any good tools or softwares for "CEP's" using for log data ?
>
> 4) Do you know any good (scientific) papers i should read about CEP ?
>
>
> Regards
> PhD student at Tampere University of Technology, Finland, www.tut.fi
> Esa Heikkinen
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: n

2016-04-27 Thread shengshanzhang
thanks a lot. I add a spark-sql dependence in build.sb as red line shows.

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-sql"  % "1.6.1"
~
> 在 2016年4月27日,下午4:58,ramesh reddy  写道:
> 
> Spark Sql jar has to be added as a dependency in build.sbt.
> 
> 
> On Wednesday, 27 April 2016 1:57 PM, shengshanzhang 
>  wrote:
> 
> 
> Hello :
> my code is as follows:
> ---
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.SQLContext
> 
> case class Record(key: Int, value: String)
> object RDDRelation {
> def main(args: Array[String]) {
> 
> val sparkConf = new SparkConf().setAppName("RDDRelation")
> val sc = new SparkContext(sparkConf)
> //val sqlContext = new SQLContext(sc)
> }
> }
> ——
> when I run "sbt package”, i come to a error as follows1
> 
> $ sbt package
> [info] Set current project to Simple Project (in build 
> file:/data/users/zhangshengshan/spark_work/)
> [info] Compiling 1 Scala source to 
> /data/users/zhangshengshan/spark_work/target/scala-2.10/classes...
> [error] 
> /data/users/zhangshengshan/spark_work/src/main/scala/SimpleApp.scala:2: 
> object sql is not a member of package org.apache.spark
> [error] import org.apache.spark.sql.SQLContext
> [error]^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
> [error] Total time: 3 s, completed Apr 27, 2016 4:20:37 PM
> 
> 
> 
> who can tell me how can i fix this problem
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: n

2016-04-27 Thread shengshanzhang
your build.sb seems a little complexed. thank you a lot.  

and the example in the official spark website, explains how to utilize 
spark-sql based on spark-shell, 
there is no instructions about how to writing a Self-Contained Applications. 
for a learner who is not 
Familiar with with scala or java, it is a bit difficulty.
 




name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-sql"  % "1.6.1"
~
> 在 2016年4月27日,下午4:50,Marco Mistroni  写道:
> 
> Hi
>  please share your build.sbt
> here's mine for reference (using Spark 1.6.1 + scala 2.10)  (pls ignore extra 
> stuff i have added for assembly and logging)
> 
> // Set the project name to the string 'My Project'
> name := "SparkExamples"
> 
> // The := method used in Name and Version is one of two fundamental methods.
> // The other method is <<=
> // All other initialization methods are implemented in terms of these.
> version := "1.0"
> 
> scalaVersion := "2.10.5"
> 
> assemblyJarName in assembly := "sparkexamples.jar"
> 
> 
> // Add a single dependency
> libraryDependencies += "junit" % "junit" % "4.8" % "test"
> libraryDependencies += "org.mockito" % "mockito-core" % "1.9.5"
> libraryDependencies ++= Seq("org.slf4j" % "slf4j-api" % "1.7.5",
> "org.slf4j" % "slf4j-simple" % "1.7.5",
> "org.clapper" %% "grizzled-slf4j" % "1.0.2")
> libraryDependencies += "org.powermock" % "powermock-mockito-release-full" % 
> "1.5.4" % "test"
> libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.6.1" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming"   % "1.6.1" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.6.1"  % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   % 
> "1.3.0"  % "provided"
> 
> resolvers += "softprops-maven" at 
> "http://dl.bintray.com/content/softprops/maven 
> "
> 
> kr
>  marco
> 
> 
> On Wed, Apr 27, 2016 at 9:27 AM, shengshanzhang  > wrote:
> Hello :
> my code is as follows:
> ---
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.SQLContext
> 
> case class Record(key: Int, value: String)
> object RDDRelation {
> def main(args: Array[String]) {
> 
> val sparkConf = new SparkConf().setAppName("RDDRelation")
> val sc = new SparkContext(sparkConf)
> //val sqlContext = new SQLContext(sc)
> }
> }
> ——
> when I run "sbt package”, i come to a error as follows1
> 
> $ sbt package
> [info] Set current project to Simple Project (in build 
> file:/data/users/zhangshengshan/spark_work/)
> [info] Compiling 1 Scala source to 
> /data/users/zhangshengshan/spark_work/target/scala-2.10/classes...
> [error] 
> /data/users/zhangshengshan/spark_work/src/main/scala/SimpleApp.scala:2: 
> object sql is not a member of package org.apache.spark
> [error] import org.apache.spark.sql.SQLContext
> [error] ^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
> [error] Total time: 3 s, completed Apr 27, 2016 4:20:37 PM
> 
> 
> 
>  who can tell me how can i fix this problem
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: n

2016-04-27 Thread Marco Mistroni
Hi
 please share your build.sbt
here's mine for reference (using Spark 1.6.1 + scala 2.10)  (pls ignore
extra stuff i have added for assembly and logging)

// Set the project name to the string 'My Project'
name := "SparkExamples"

// The := method used in Name and Version is one of two fundamental methods.
// The other method is <<=
// All other initialization methods are implemented in terms of these.
version := "1.0"

scalaVersion := "2.10.5"

assemblyJarName in assembly := "sparkexamples.jar"


// Add a single dependency
libraryDependencies += "junit" % "junit" % "4.8" % "test"
libraryDependencies += "org.mockito" % "mockito-core" % "1.9.5"
libraryDependencies ++= Seq("org.slf4j" % "slf4j-api" % "1.7.5",
"org.slf4j" % "slf4j-simple" % "1.7.5",
"org.clapper" %% "grizzled-slf4j" % "1.0.2")
libraryDependencies += "org.powermock" % "powermock-mockito-release-full" %
"1.5.4" % "test"
libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.6.1" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming"   % "1.6.1"
% "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.6.1"  %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   %
"1.3.0"  % "provided"

resolvers += "softprops-maven" at "
http://dl.bintray.com/content/softprops/maven;

kr
 marco


On Wed, Apr 27, 2016 at 9:27 AM, shengshanzhang 
wrote:

> Hello :
> my code is as follows:
> ---
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.SQLContext
>
> case class Record(key: Int, value: String)
> object RDDRelation {
> def main(args: Array[String]) {
>
> val sparkConf = new SparkConf().setAppName("RDDRelation")
> val sc = new SparkContext(sparkConf)
> //val sqlContext = new SQLContext(sc)
> }
> }
> ——
> when I run "sbt package”, i come to a error as follows1
>
> $ sbt package
> [info] Set current project to Simple Project (in build
> file:/data/users/zhangshengshan/spark_work/)
> [info] Compiling 1 Scala source to
> /data/users/zhangshengshan/spark_work/target/scala-2.10/classes...
> [error]
> /data/users/zhangshengshan/spark_work/src/main/scala/SimpleApp.scala:2:
> object sql is not a member of package org.apache.spark
> [error] import org.apache.spark.sql.SQLContext
> [error] ^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
> [error] Total time: 3 s, completed Apr 27, 2016 4:20:37 PM
>
>
>
>  who can tell me how can i fix this problem
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


error: value toDS is not a member of Seq[Int] SQL

2016-04-27 Thread shengshanzhang
Hi,

On spark website, there is code as follows showing how to create 
datasets.
 

However when i input this line into spark-shell,there comes a Error, 
and who can tell me Why and how to fix this?

scala> val ds = Seq(1, 2, 3).toDS()
:35: error: value toDS is not a member of Seq[Int]
   val ds = Seq(1, 2, 3).toDS()



Thank you a lot!

Re: Save DataFrame to HBase

2016-04-27 Thread Daniel Haviv
Hi Benjamin,
Yes it should work.

Let me know if you need further assistance I might be able to get the code I've 
used for that project.

Thank you.
Daniel

> On 24 Apr 2016, at 17:35, Benjamin Kim  wrote:
> 
> Hi Daniel,
> 
> How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed which 
> comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?
> 
> Thanks,
> Ben
> 
>> On Apr 24, 2016, at 1:43 AM, Daniel Haviv  
>> wrote:
>> 
>> Hi,
>> I tried saving DF to HBase using a hive table with hbase storage handler and 
>> hiveContext but it failed due to a bug.
>> 
>> I was able to persist the DF to hbase using Apache Pheonix which was pretty 
>> simple.
>> 
>> Thank you.
>> Daniel
>> 
>>> On 21 Apr 2016, at 16:52, Benjamin Kim  wrote:
>>> 
>>> Has anyone found an easy way to save a DataFrame into HBase?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
> 


n

2016-04-27 Thread shengshanzhang
Hello :
my code is as follows:
---
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext

case class Record(key: Int, value: String)
object RDDRelation {
def main(args: Array[String]) {

val sparkConf = new SparkConf().setAppName("RDDRelation")
val sc = new SparkContext(sparkConf)
//val sqlContext = new SQLContext(sc)
}
}
——
when I run "sbt package”, i come to a error as follows1

$ sbt package
[info] Set current project to Simple Project (in build 
file:/data/users/zhangshengshan/spark_work/)
[info] Compiling 1 Scala source to 
/data/users/zhangshengshan/spark_work/target/scala-2.10/classes...
[error] /data/users/zhangshengshan/spark_work/src/main/scala/SimpleApp.scala:2: 
object sql is not a member of package org.apache.spark
[error] import org.apache.spark.sql.SQLContext
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 3 s, completed Apr 27, 2016 4:20:37 PM



 who can tell me how can i fix this problem
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: n

2016-04-27 Thread ramesh reddy
Spark Sql jar has to be added as a dependency in build.sbt. 

On Wednesday, 27 April 2016 1:57 PM, shengshanzhang 
 wrote:
 

 Hello :
    my code is as follows:
---
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext

case class Record(key: Int, value: String)
    object RDDRelation {
        def main(args: Array[String]) {

            val sparkConf = new SparkConf().setAppName("RDDRelation")
            val sc = new SparkContext(sparkConf)
            //val sqlContext = new SQLContext(sc)
        }
    }
——
when I run "sbt package”, i come to a error as follows1

$ sbt package
[info] Set current project to Simple Project (in build 
file:/data/users/zhangshengshan/spark_work/)
[info] Compiling 1 Scala source to 
/data/users/zhangshengshan/spark_work/target/scala-2.10/classes...
[error] /data/users/zhangshengshan/spark_work/src/main/scala/SimpleApp.scala:2: 
object sql is not a member of package org.apache.spark
[error] import org.apache.spark.sql.SQLContext
[error]                        ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 3 s, completed Apr 27, 2016 4:20:37 PM



 who can tell me how can i fix this problem
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


  

Re: n

2016-04-27 Thread shengshanzhang
thanks a lot. I add a spark-sql dependence in build.sb as red line shows.

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-sql"  % "1.6.1"
~
> 在 2016年4月27日,下午4:58,ramesh reddy  > 写道:
> 
> Spark Sql jar has to be added as a dependency in build.sbt.
> 
> 
> On Wednesday, 27 April 2016 1:57 PM, shengshanzhang 
> > wrote:
> 
> 
> Hello :
> my code is as follows:
> ---
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.SQLContext
> 
> case class Record(key: Int, value: String)
> object RDDRelation {
> def main(args: Array[String]) {
> 
> val sparkConf = new SparkConf().setAppName("RDDRelation")
> val sc = new SparkContext(sparkConf)
> //val sqlContext = new SQLContext(sc)
> }
> }
> ——
> when I run "sbt package”, i come to a error as follows1
> 
> $ sbt package
> [info] Set current project to Simple Project (in build 
> file:/data/users/zhangshengshan/spark_work/)
> [info] Compiling 1 Scala source to 
> /data/users/zhangshengshan/spark_work/target/scala-2.10/classes...
> [error] 
> /data/users/zhangshengshan/spark_work/src/main/scala/SimpleApp.scala:2: 
> object sql is not a member of package org.apache.spark
> [error] import org.apache.spark.sql.SQLContext
> [error]^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
> [error] Total time: 3 s, completed Apr 27, 2016 4:20:37 PM
> 
> 
> 
> who can tell me how can i fix this problem
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Esa Heikkinen

Hi

I have followed with interest the discussion about CEP and Spark. It is 
quite close to my research, which is a complex analyzing for log files 
and "history" data  (not actually for real time streams).


I have few questions:

1) Is CEP only for (real time) stream data and not for "history" data?

2) Is it possible to search "backward" (upstream) by CEP with given time 
window? If a start time of the time window is earlier than the current 
stream time.


3) Do you know any good tools or softwares for "CEP's" using for log data ?

4) Do you know any good (scientific) papers i should read about CEP ?


Regards
PhD student at Tampere University of Technology, Finland, www.tut.fi
Esa Heikkinen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel
Sorry sent from wrong email address. 

> Begin forwarded message:
> 
> From: Michael Segel 
> Subject: Re: Spark support for Complex Event Processing (CEP)
> Date: April 27, 2016 at 7:51:14 AM CDT
> To: Mich Talebzadeh 
> Cc: Esa Heikkinen , "user @spark" 
> 
> 
> Spark and CEP? It depends… 
> 
> Ok, I know that’s not the answer you want to hear, but its a bit more 
> complicated… 
> 
> If you consider Spark Streaming, you have some issues. 
> Spark Streaming isn’t a Real Time solution because it is a micro batch 
> solution. The smallest Window is 500ms.  This means that if your compute time 
> is >> 500ms and/or  your event flow is >> 500ms this could work.
> (e.g. 'real time' image processing on a system that is capturing 60FPS 
> because the processing time is >> 500ms. ) 
> 
> So Spark Streaming wouldn’t be the best solution…. 
> 
> However, you can combine spark with other technologies like Storm, Akka, etc 
> .. where you have continuous streaming. 
> So you could instantiate a spark context per worker in storm… 
> 
> I think if there are no class collisions between Akka and Spark, you could 
> use Akka, which may have a better potential for communication between 
> workers. 
> So here you can handle CEP events. 
> 
> HTH
> 
>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh > > wrote:
>> 
>> please see my other reply
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 27 April 2016 at 10:40, Esa Heikkinen > > wrote:
>> Hi
>> 
>> I have followed with interest the discussion about CEP and Spark. It is 
>> quite close to my research, which is a complex analyzing for log files and 
>> "history" data  (not actually for real time streams).
>> 
>> I have few questions:
>> 
>> 1) Is CEP only for (real time) stream data and not for "history" data?
>> 
>> 2) Is it possible to search "backward" (upstream) by CEP with given time 
>> window? If a start time of the time window is earlier than the current 
>> stream time.
>> 
>> 3) Do you know any good tools or softwares for "CEP's" using for log data ?
>> 
>> 4) Do you know any good (scientific) papers i should read about CEP ?
>> 
>> 
>> Regards
>> PhD student at Tampere University of Technology, Finland, www.tut.fi 
>> 
>> Esa Heikkinen
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
> 
> The opinions expressed here are mine, while they may reflect a cognitive 
> thought, that is purely accidental. 
> Use at your own risk. 
> Michael Segel
> michael_segel (AT) hotmail.com 
> 
> 
> 
> 
> 




unsubscribe

2016-04-27 Thread Harjit Singh








signature.asc
Description: Message signed with OpenPGP using GPGMail


unsubscribe

2016-04-27 Thread Burger, Robert


Robert Burger | Solutions Design IT Specialist | CBAW TS | TD Wealth Technology 
Solutions
79 Wellington Street West, 17th Floor, TD South Tower, Toronto, ON, M5K 1A2


If you wish to unsubscribe from receiving commercial electronic messages from 
TD Bank Group, please click here or go to the following web address: 
www.td.com/tdoptout
Si vous souhaitez vous désabonner des messages électroniques de nature 
commerciale envoyés par Groupe Banque TD veuillez cliquer ici ou vous rendre à 
l'adresse www.td.com/tddesab

NOTICE: Confidential message which may be privileged. Unauthorized 
use/disclosure prohibited. If received in error, please go to www.td.com/legal 
for instructions.
AVIS : Message confidentiel dont le contenu peut être privilégié. 
Utilisation/divulgation interdites sans permission. Si reçu par erreur, prière 
d'aller au www.td.com/francais/avis_juridique pour des instructions.


Spark Metrics for Ganglia

2016-04-27 Thread Khalid Latif

I built Apache Spark on Ubuntu 14.04 LTS with the following command:

mvn -Pspark-ganglia-lgpl -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 
-DskipTests clean package


Build was successful. Then, following modifications were made.

1. Included "SPARK_LOCAL_IP=127.0.0.1" to the file 
$SPARK_HOME/conf/spark-env.sh to avoid the following warnings.


16/04/27 17:45:54 WARN Utils: Your hostname, ganglia resolves 
to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface 
eth0)
16/04/27 17:45:54 WARN Utils: Set SPARK_LOCAL_IP if you need to 
bind to another address


After that, Spark started well without these warnings.

2. To enable Ganglia metrics, following lines were included to the 
file $SPARK_HOME/conf/metrics.properties


# Enable GangliaSink for all instances
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.name=SparkCluster
*.sink.ganglia.host=XYZ.XYZ.XYZ.XYZ (Replaced by real IP)
*.sink.ganglia.port=8649
*.sink.ganglia.period=10
*.sink.ganglia.unit=seconds
*.sink.ganglia.ttl=1
*.sink.ganglia.mode=multicast

Following errors were displayed, but Spark got started.

16/04/27 17:45:59 ERROR MetricsSystem: Sink class 
org.apache.spark.metrics.sink.GangliaSink  cannot be instantiated
16/04/27 17:45:59 ERROR SparkContext: Error initializing 
SparkContext.
java.lang.ClassNotFoundException: 
org.apache.spark.metrics.sink.GangliaSink
at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
...


"GangliaSink" class can be found at: 
$SPARK_HOME/external/spark-ganglia-lgpl/target/classes/org/apache/spark/metrics/sink/GangliaSink.class


I can see previous threads regarding the same problem but I cannot find 
any solution. Any idea?






Re: executor delay in Spark

2016-04-27 Thread Mike Hynes
Hi Raghava,

I'm terribly sorry about the end of my last email; that garbled
sentence was garbled because it wasn't meant to exist; I wrote it on
my phone, realized I wouldn't realistically have time to look into
another set of logs deeply enough, and then mistook myself for having
deleted it. Again, I'm very sorry for my error here.

I did peek at your code, though, and think you could try the following:
0. The actions in your main method are many, and it will be hard to
isolate a problem; I would recommend only examing *one* RDD at first,
rather than six.
1. There is a lot of repetition for reading RDDs from textfiles
sequentially; if you put those lines into two methods depending on RDD
type, you will at least have one entry point to work with once you
make a simplified test program.
2. In one part you persist, count, immediately unpersist, and then
count again an RDD.. I'm not acquainted with this idiom, and I don't
understand what that is to achieve. It strikes me suspect for
triggering unusual garbage collection, which would, I think, only
complicate your trace debugging.

I've attached a python script that dumps relevant info from the Spark
JSON logs into a CSV for easier analysis in you language of choice;
hopefully it can aid in finer grained debugging (the headers of the
fields it prints are listed in one of the functions).

Mike

On 4/25/16, Raghava Mutharaju  wrote:
> Mike,
>
> We ran our program with 16, 32 and 64 partitions. The behavior was same as
> before with 8 partitions. It was mixed -- for some RDDs we see an
> all-nothing skew, but for others we see them getting split across the 2
> worker nodes. In some cases, they start with even split and in later
> iterations it goes to all-nothing split. Please find the logs attached.
>
> our program source code:
> https://github.com/raghavam/sparkel/blob/275ecbb901a58592d8a70a8568dd95c839d46ecc/src/main/scala/org/daselab/sparkel/SparkELDAGAnalysis.scala
>
> We put in persist() statements for different RDDs just to check their skew.
>
> @Jeff, setting minRegisteredResourcesRatio did not help. Behavior was same
> as before.
>
> Thank you for your time.
>
> Regards,
> Raghava.
>
>
> On Sun, Apr 24, 2016 at 7:17 PM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Could you change numPartitions to {16, 32, 64} and run your program for
>> each to see how many partitions are allocated to each worker? Let's see
>> if
>> you experience an all-nothing imbalance that way; if so, my guess is that
>> something else is odd in your program logic or spark runtime environment,
>> but if not and your executors all receive at least *some* partitions,
>> then
>> I still wouldn't rule out effects of scheduling delay. It's a simple
>> test,
>> but it could give some insight.
>>
>> Mike
>>
>> his could still be a  scheduling  If only one has *all* partitions,  and
>> email me the log file? (If it's 10+ MB, just the first few thousand lines
>> are fine).
>> On Apr 25, 2016 7:07 AM, "Raghava Mutharaju" 
>> wrote:
>>
>>> Mike, All,
>>>
>>> It turns out that the second time we encountered the uneven-partition
>>> issue is not due to spark-submit. It was resolved with the change in
>>> placement of count().
>>>
>>> Case-1:
>>>
>>> val numPartitions = 8
>>> // read uAxioms from HDFS, use hash partitioner on it and persist it
>>> // read type1Axioms from HDFS, use hash partitioner on it and persist it
>>> currDeltaURule1 = type1Axioms.join(uAxioms)
>>>  .values
>>>  .distinct(numPartitions)
>>>  .partitionBy(hashPartitioner)
>>> currDeltaURule1 = currDeltaURule1.setName("deltaURule1_" + loopCounter)
>>>
>>>  .persist(StorageLevel.MEMORY_AND_DISK)
>>>.count()
>>>
>>> 
>>>
>>> currDeltaURule1 RDD results in all the data on one node (there are 2
>>> worker nodes). If we move count(), the uneven partition issue is
>>> resolved.
>>>
>>> Case-2:
>>>
>>> currDeltaURule1 = currDeltaURule1.setName("deltaURule1_" + loopCounter)
>>>
>>>  .persist(StorageLevel.MEMORY_AND_DISK)
>>>
>>>
>>> 
>>>
>>>  -- this rdd depends on currDeltaURule1 and it gets
>>> executed. This resolved the uneven partitioning issue.
>>>
>>> I don't see why the moving of an action to a later part in the code
>>> would
>>> affect the partitioning. Are there other factors at play here that
>>> affect
>>> the partitioning?
>>>
>>> (Inconsistent) uneven partitioning leads to one machine getting over
>>> burdened (memory and number of tasks). We see a clear improvement in
>>> runtime when the partitioning is even (happens when count is moved).
>>>
>>> Any pointers in figuring out this issue is much appreciated.
>>>
>>> Regards,
>>> Raghava.
>>>
>>>
>>>
>>>
>>> On Fri, Apr 22, 2016 at 7:40 PM, Mike Hynes <91m...@gmail.com> wrote:
>>>
 Glad to hear that the problem was solvable! I have not seen