Join operation on DStreams

2015-09-21 Thread guoxu1231
Hi Spark Experts, 

I'm trying to use join(otherStream, [numTasks]) on DStreams,  and it
requires called on two DStreams of (K, V) and (K, W) pairs,

Usually in common RDD, we could use keyBy(f) to build the (K, V) pair,
however I could not find it in DStream. 

My question is:
What is the expected way to build (K, V) pair in DStream?


Thanks
Shawn




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: SparkR package path

2015-09-21 Thread Sun, Rui
Hossein,

Any strong reason to download and install SparkR source package separately from 
the Spark distribution?
An R user can simply download the spark distribution, which contains SparkR 
source and binary package, and directly use sparkR. No need to install SparkR 
package at all.

From: Hossein [mailto:fal...@gmail.com]
Sent: Tuesday, September 22, 2015 9:19 AM
To: dev@spark.apache.org
Subject: SparkR package path

Hi dev list,

SparkR backend assumes SparkR source files are located under 
"SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh. 
This setting makes sense for Spark developers, but if an R user downloads and 
installs SparkR source package, the source files are going to be in placed 
different locations.

In the R runtime it is easy to find location of package files using 
path.package("SparkR"). But we need to make some changes to R backend and/or 
spark-submit so that, JVM process learns the location of worker.R and daemon.R 
and shell.R from the R runtime.

Do you think this change is feasible?

Thanks,
--Hossein


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
Yeah, whoever is maintaining the scripts and snapshot builds has fallen
down on the job -- but there is nothing preventing you from checking out
branch-1.5 and creating your own build, which is arguably a smarter thing
to do anyway.  If I'm going to use a non-release build, then I want the
full git commit history of exactly what is in that build readily available,
not just somewhat arbitrary JARs.

On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  wrote:

> But I cannot find 1.5.1-SNAPSHOT either at
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>
> Mark Hamstra 于2015年9月22日周二 下午12:55写道:
>
>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The
>> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release
>> candidates and then the 1.5.1 release.
>>
>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:
>>
>>> I'd like to use some important bug fixes in 1.5 branch and I look for
>>> the apache maven host, but don't find any snapshot for 1.5 branch.
>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>
>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?
>>>
>>
>>


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Fengdong Yu
Do you mean you want to publish the artifact to your private repository?

if so, please using ‘sbt publish’

add the following in your build.sb:

publishTo := {
  val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
  if (version.value.endsWith("SNAPSHOT"))
Some("snapshots" at nexus + "content/repositories/snapshots")
  else
Some("releases"  at nexus + "content/repositories/releases")

}



> On Sep 22, 2015, at 13:26, Bin Wang  wrote:
> 
> My project is using sbt (or maven), which need to download dependency from a 
> maven repo. I have my own private maven repo with nexus but I don't know how 
> to push my own build to it, can you give me a hint?
> 
> Mark Hamstra  >于2015年9月22日周二 下午1:25写道:
> Yeah, whoever is maintaining the scripts and snapshot builds has fallen down 
> on the job -- but there is nothing preventing you from checking out 
> branch-1.5 and creating your own build, which is arguably a smarter thing to 
> do anyway.  If I'm going to use a non-release build, then I want the full git 
> commit history of exactly what is in that build readily available, not just 
> somewhat arbitrary JARs.
> 
> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  > wrote:
> But I cannot find 1.5.1-SNAPSHOT either at 
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>  
> 
> Mark Hamstra  >于2015年9月22日周二 下午12:55写道:
> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The 
> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release 
> candidates and then the 1.5.1 release.
> 
> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  > wrote:
> I'd like to use some important bug fixes in 1.5 branch and I look for the 
> apache maven host, but don't find any snapshot for 1.5 branch. 
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>  
> 
> 
> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?
> 
> 



Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
But I cannot find 1.5.1-SNAPSHOT either at
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/

Mark Hamstra 于2015年9月22日周二 下午12:55写道:

> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The
> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release
> candidates and then the 1.5.1 release.
>
> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:
>
>> I'd like to use some important bug fixes in 1.5 branch and I look for the
>> apache maven host, but don't find any snapshot for 1.5 branch.
>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>
>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?
>>
>
>


Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
I'd like to use some important bug fixes in 1.5 branch and I look for the
apache maven host, but don't find any snapshot for 1.5 branch.
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/

I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
My project is using sbt (or maven), which need to download dependency from
a maven repo. I have my own private maven repo with nexus but I don't know
how to push my own build to it, can you give me a hint?

Mark Hamstra 于2015年9月22日周二 下午1:25写道:

> Yeah, whoever is maintaining the scripts and snapshot builds has fallen
> down on the job -- but there is nothing preventing you from checking out
> branch-1.5 and creating your own build, which is arguably a smarter thing
> to do anyway.  If I'm going to use a non-release build, then I want the
> full git commit history of exactly what is in that build readily available,
> not just somewhat arbitrary JARs.
>
> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  wrote:
>
>> But I cannot find 1.5.1-SNAPSHOT either at
>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>>
>> Mark Hamstra 于2015年9月22日周二 下午12:55写道:
>>
>>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The
>>> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release
>>> candidates and then the 1.5.1 release.
>>>
>>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:
>>>
 I'd like to use some important bug fixes in 1.5 branch and I look for
 the apache maven host, but don't find any snapshot for 1.5 branch.
 https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/

 I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?

>>>
>>>
>


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
However I find some scripts in dev/audit-release, can I use them?

Bin Wang 于2015年9月22日周二 下午1:34写道:

> No, I mean push spark to my private repository. Spark don't have a
> build.sbt as far as I see.
>
> Fengdong Yu 于2015年9月22日周二 下午1:29写道:
>
>> Do you mean you want to publish the artifact to your private repository?
>>
>> if so, please using ‘sbt publish’
>>
>> add the following in your build.sb:
>>
>> publishTo := {
>>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
>>   if (version.value.endsWith("SNAPSHOT"))
>> Some("snapshots" at nexus + "content/repositories/snapshots")
>>   else
>> Some("releases"  at nexus + "content/repositories/releases")
>>
>> }
>>
>>
>>
>> On Sep 22, 2015, at 13:26, Bin Wang  wrote:
>>
>> My project is using sbt (or maven), which need to download dependency
>> from a maven repo. I have my own private maven repo with nexus but I don't
>> know how to push my own build to it, can you give me a hint?
>>
>> Mark Hamstra 于2015年9月22日周二 下午1:25写道:
>>
>>> Yeah, whoever is maintaining the scripts and snapshot builds has fallen
>>> down on the job -- but there is nothing preventing you from checking out
>>> branch-1.5 and creating your own build, which is arguably a smarter thing
>>> to do anyway.  If I'm going to use a non-release build, then I want the
>>> full git commit history of exactly what is in that build readily available,
>>> not just somewhat arbitrary JARs.
>>>
>>> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  wrote:
>>>
 But I cannot find 1.5.1-SNAPSHOT either at
 https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/

 Mark Hamstra 于2015年9月22日周二 下午12:55写道:

> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
> The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1
> release candidates and then the 1.5.1 release.
>
> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:
>
>> I'd like to use some important bug fixes in 1.5 branch and I look for
>> the apache maven host, but don't find any snapshot for 1.5 branch.
>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>
>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
>> 1.5.X?
>>
>
>
>>>
>>


Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Reynold Xin
What's the plan if you run explain?

In 1.5 the default should be TungstenAggregate, which does spill (switching
from hash-based aggregation to sort-based aggregation).

On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah  wrote:

> Hi everyone,
>
> I’m debugging some slowness and apparent memory pressure + GC issues after
> I ported some workflows from raw RDDs to Data Frames. In particular, I’m
> looking into an aggregation workflow that computes many aggregations per
> key at once.
>
> My workflow before was doing a fairly straightforward combineByKey call
> where the aggregation would build up non-trivially sized objects in memory
> – I was computing numerous sums over various fields of the data at a time.
> In particular, I noticed that there was spilling to disk on the map side of
> the aggregation.
>
> When I switched to using DataFrames aggregation – particularly
> DataFrame.groupBy(some list of keys).agg(exprs) where I passed a large
> number of “Sum” exprs - the execution began to choke. I saw one of my
> executors had a long GC pause and my job isn’t able to recover. However
> when I reduced the number of Sum expressions being computed in the
> aggregation, the workflow started to work normally.
>
> I have a hypothesis that I would like to run by everyone. In
> org.apache.spark.sql.execution.Aggregate.scala, branch-1.5, I’m looking at
> the execution of Aggregation
> ,
> which appears to build the aggregation result in memory via updating a
> HashMap and iterating over the rows. However this appears to be less robust
> than what would happen if PairRDDFunctions.combineByKey were to be used. If
> combineByKey were used, then instead of using two mapPartitions calls
> (assuming the aggregation is partially-computable, as Sum is), it would use
> the ExternalSorter and ExternalAppendOnlyMap objects to compute the
> aggregation. This would allow the aggregated result to grow large as some
> of the aggregated result could be spilled to disk. This especially seems
> bad if the aggregation reduction factor is low; that is, if there are many
> unique keys in the dataset. In particular, the Hash Map is holding O(# of
> keys * number of aggregated results per key) items in memory at a time.
>
> I was wondering what everyone’s thought on this problem is. Did we
> consciously think about the robustness implications when choosing to use an
> in memory Hash Map to compute the aggregation? Is this an inherent
> limitation of the aggregation implementation in Data Frames?
>
> Thanks,
>
> -Matt Cheah
>
>
>
>
>
>


Re: passing SparkContext as parameter

2015-09-21 Thread Priya Ch
can i use this sparkContext on executors ??
In my application, i have scenario of reading from db for certain records
in rdd. Hence I need sparkContext to read from DB (cassandra in our case),

If sparkContext couldn't be sent to executors , what is the workaround for
this ??

On Mon, Sep 21, 2015 at 3:06 PM, Petr Novak  wrote:

> add @transient?
>
> On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch 
> wrote:
>
>> Hello All,
>>
>> How can i pass sparkContext as a parameter to a method in an object.
>> Because passing sparkContext is giving me TaskNotSerializable Exception.
>>
>> How can i achieve this ?
>>
>> Thanks,
>> Padma Ch
>>
>
>


Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
sparkConext is available on the driver, not on executors.

To read from Cassandra, you can use something like this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Mon, Sep 21, 2015 at 2:27 PM, Priya Ch 
wrote:

> can i use this sparkContext on executors ??
> In my application, i have scenario of reading from db for certain records
> in rdd. Hence I need sparkContext to read from DB (cassandra in our case),
>
> If sparkContext couldn't be sent to executors , what is the workaround for
> this ??
>
> On Mon, Sep 21, 2015 at 3:06 PM, Petr Novak  wrote:
>
>> add @transient?
>>
>> On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch 
>> wrote:
>>
>>> Hello All,
>>>
>>> How can i pass sparkContext as a parameter to a method in an object.
>>> Because passing sparkContext is giving me TaskNotSerializable Exception.
>>>
>>> How can i achieve this ?
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>
>>
>


Re: passing SparkContext as parameter

2015-09-21 Thread Ted Yu
You can use broadcast variable for passing connection information. 

Cheers

> On Sep 21, 2015, at 4:27 AM, Priya Ch  wrote:
> 
> can i use this sparkContext on executors ??
> In my application, i have scenario of reading from db for certain records in 
> rdd. Hence I need sparkContext to read from DB (cassandra in our case),
> 
> If sparkContext couldn't be sent to executors , what is the workaround for 
> this ??
> 
>> On Mon, Sep 21, 2015 at 3:06 PM, Petr Novak  wrote:
>> add @transient?
>> 
>>> On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch  
>>> wrote:
>>> Hello All,
>>> 
>>> How can i pass sparkContext as a parameter to a method in an object. 
>>> Because passing sparkContext is giving me TaskNotSerializable Exception.
>>> 
>>> How can i achieve this ?
>>> 
>>> Thanks,
>>> Padma Ch
> 


Forecasting Library For Apache Spark

2015-09-21 Thread Mohamed Baddar
Hello everybody , this my first mail in the List , and i would like to
introduce my self first :)
My Name is Mohamed baddar , I work as Big Data and Analytics Software
Engieer at BADRIT (http://badrit.com/) , a software Startup with focus in
Big Data , also i have been working for 6+ years at IBM R Egypt , in HPC
, Big Data and Analytics Are

I just have a question , i can't find supported Apache Spark library for
forecasting using ARIMA , ETS , Bayesian model or any method , is there any
plans for such a development , as i can't find any issue talking about it ,
is any one interested to have/develop a related module , as i find it a
critical feature to be added to SPARK

Thanks


Re: Forecasting Library For Apache Spark

2015-09-21 Thread Mohamed Baddar
Thanks Corey for the suggestion , will check it

On Mon, Sep 21, 2015 at 2:43 PM, Corey Nolet  wrote:

> Mohamed,
>
> Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA
> was added to this recently and seasonal ARIMA should be following shortly.
>
> [1] https://github.com/cloudera/spark-timeseries
>
> On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar  > wrote:
>
>> Hello everybody , this my first mail in the List , and i would like to
>> introduce my self first :)
>> My Name is Mohamed baddar , I work as Big Data and Analytics Software
>> Engieer at BADRIT (http://badrit.com/) , a software Startup with focus
>> in Big Data , also i have been working for 6+ years at IBM R Egypt , in
>> HPC , Big Data and Analytics Are
>>
>> I just have a question , i can't find supported Apache Spark library for
>> forecasting using ARIMA , ETS , Bayesian model or any method , is there any
>> plans for such a development , as i can't find any issue talking about it ,
>> is any one interested to have/develop a related module , as i find it a
>> critical feature to be added to SPARK
>>
>> Thanks
>>
>
>


Re: Forecasting Library For Apache Spark

2015-09-21 Thread Corey Nolet
Mohamed,

Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA
was added to this recently and seasonal ARIMA should be following shortly.

[1] https://github.com/cloudera/spark-timeseries

On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar 
wrote:

> Hello everybody , this my first mail in the List , and i would like to
> introduce my self first :)
> My Name is Mohamed baddar , I work as Big Data and Analytics Software
> Engieer at BADRIT (http://badrit.com/) , a software Startup with focus in
> Big Data , also i have been working for 6+ years at IBM R Egypt , in HPC
> , Big Data and Analytics Are
>
> I just have a question , i can't find supported Apache Spark library for
> forecasting using ARIMA , ETS , Bayesian model or any method , is there any
> plans for such a development , as i can't find any issue talking about it ,
> is any one interested to have/develop a related module , as i find it a
> critical feature to be added to SPARK
>
> Thanks
>


Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
I was executing on Spark 1.4 so I didn¹t notice the Tungsten option would
make spilling happen in 1.5. I¹ll upgrade to 1.5 and see how that turns out.
Thanks!

From:  Reynold Xin 
Date:  Monday, September 21, 2015 at 5:36 PM
To:  Matt Cheah 
Cc:  "dev@spark.apache.org" , Mingyu Kim
, Peter Faiman 
Subject:  Re: DataFrames Aggregate does not spill?

What's the plan if you run explain?

In 1.5 the default should be TungstenAggregate, which does spill (switching
from hash-based aggregation to sort-based aggregation).

On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah  wrote:
> Hi everyone,
> 
> I¹m debugging some slowness and apparent memory pressure + GC issues after I
> ported some workflows from raw RDDs to Data Frames. In particular, I¹m looking
> into an aggregation workflow that computes many aggregations per key at once.
> 
> My workflow before was doing a fairly straightforward combineByKey call where
> the aggregation would build up non-trivially sized objects in memory ­ I was
> computing numerous sums over various fields of the data at a time. In
> particular, I noticed that there was spilling to disk on the map side of the
> aggregation.
> 
> When I switched to using DataFrames aggregation ­ particularly
> DataFrame.groupBy(some list of keys).agg(exprs) where I passed a large number
> of ³Sum² exprs - the execution began to choke. I saw one of my executors had a
> long GC pause and my job isn¹t able to recover. However when I reduced the
> number of Sum expressions being computed in the aggregation, the workflow
> started to work normally.
> 
> I have a hypothesis that I would like to run by everyone. In
> org.apache.spark.sql.execution.Aggregate.scala, branch-1.5, I¹m looking at the
> execution of Aggregation
>  blob_branch-2D1.5_sql_core_src_main_scala_org_apache_spark_sql_execution_Aggre
> gate.scala=BQMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E9
> 9EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=MY0QvbkaVGKP6m7L6daL19eak5Q_ByWt_84mRZfff8
> k=2f8iTPkA6bsre40-juWK2Q5xA-v_5y6f3ucP4cKKa1s=> , which appears to build
> the aggregation result in memory via updating a HashMap and iterating over the
> rows. However this appears to be less robust than what would happen if
> PairRDDFunctions.combineByKey were to be used. If combineByKey were used, then
> instead of using two mapPartitions calls (assuming the aggregation is
> partially-computable, as Sum is), it would use the ExternalSorter and
> ExternalAppendOnlyMap objects to compute the aggregation. This would allow the
> aggregated result to grow large as some of the aggregated result could be
> spilled to disk. This especially seems bad if the aggregation reduction factor
> is low; that is, if there are many unique keys in the dataset. In particular,
> the Hash Map is holding O(# of keys * number of aggregated results per key)
> items in memory at a time.
> 
> I was wondering what everyone¹s thought on this problem is. Did we consciously
> think about the robustness implications when choosing to use an in memory Hash
> Map to compute the aggregation? Is this an inherent limitation of the
> aggregation implementation in Data Frames?
> 
> Thanks,
> 
> -Matt Cheah
> 
> 
> 
> 
> 





smime.p7s
Description: S/MIME cryptographic signature


Re: How to modify Hadoop APIs used by Spark?

2015-09-21 Thread Dogtail L
Oh, I want to modify existing Hadoop InputFormat.

On Mon, Sep 21, 2015 at 4:23 PM, Ted Yu  wrote:

> Can you clarify what you want to do:
> If you modify existing hadoop InputFormat, etc, it would be a matter of
> rebuilding hadoop and build Spark using the custom built hadoop as
> dependency.
>
> Do you introduce new InputFormat ?
>
> Cheers
>
> On Mon, Sep 21, 2015 at 1:20 PM, Dogtail Ray 
> wrote:
>
>> Hi all,
>>
>> I find that Spark uses some Hadoop APIs such as InputFormat, InputSplit,
>> etc., and I want to modify these Hadoop APIs. Do you know how can I
>> integrate my modified Hadoop code into Spark? Great thanks!
>>
>>
>


DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
Hi everyone,

I¹m debugging some slowness and apparent memory pressure + GC issues after I
ported some workflows from raw RDDs to Data Frames. In particular, I¹m
looking into an aggregation workflow that computes many aggregations per key
at once.

My workflow before was doing a fairly straightforward combineByKey call
where the aggregation would build up non-trivially sized objects in memory ­
I was computing numerous sums over various fields of the data at a time. In
particular, I noticed that there was spilling to disk on the map side of the
aggregation.

When I switched to using DataFrames aggregation ­ particularly
DataFrame.groupBy(some list of keys).agg(exprs) where I passed a large
number of ³Sum² exprs - the execution began to choke. I saw one of my
executors had a long GC pause and my job isn¹t able to recover. However when
I reduced the number of Sum expressions being computed in the aggregation,
the workflow started to work normally.

I have a hypothesis that I would like to run by everyone. In
org.apache.spark.sql.execution.Aggregate.scala, branch-1.5, I¹m looking at
the execution of Aggregation
 , which appears to build the
aggregation result in memory via updating a HashMap and iterating over the
rows. However this appears to be less robust than what would happen if
PairRDDFunctions.combineByKey were to be used. If combineByKey were used,
then instead of using two mapPartitions calls (assuming the aggregation is
partially-computable, as Sum is), it would use the ExternalSorter and
ExternalAppendOnlyMap objects to compute the aggregation. This would allow
the aggregated result to grow large as some of the aggregated result could
be spilled to disk. This especially seems bad if the aggregation reduction
factor is low; that is, if there are many unique keys in the dataset. In
particular, the Hash Map is holding O(# of keys * number of aggregated
results per key) items in memory at a time.

I was wondering what everyone¹s thought on this problem is. Did we
consciously think about the robustness implications when choosing to use an
in memory Hash Map to compute the aggregation? Is this an inherent
limitation of the aggregation implementation in Data Frames?

Thanks,

-Matt Cheah









smime.p7s
Description: S/MIME cryptographic signature


SparkR package path

2015-09-21 Thread Hossein
Hi dev list,

SparkR backend assumes SparkR source files are located under
"SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
This setting makes sense for Spark developers, but if an R user downloads
and installs SparkR source package, the source files are going to be in
placed different locations.

In the R runtime it is easy to find location of package files using
path.package("SparkR"). But we need to make some changes to R backend
and/or spark-submit so that, JVM process learns the location of worker.R
and daemon.R and shell.R from the R runtime.

Do you think this change is feasible?

Thanks,
--Hossein


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The
current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release
candidates and then the 1.5.1 release.

On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:

> I'd like to use some important bug fixes in 1.5 branch and I look for the
> apache maven host, but don't find any snapshot for 1.5 branch.
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>
> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?
>


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
No, I mean push spark to my private repository. Spark don't have a
build.sbt as far as I see.

Fengdong Yu 于2015年9月22日周二 下午1:29写道:

> Do you mean you want to publish the artifact to your private repository?
>
> if so, please using ‘sbt publish’
>
> add the following in your build.sb:
>
> publishTo := {
>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
>   if (version.value.endsWith("SNAPSHOT"))
> Some("snapshots" at nexus + "content/repositories/snapshots")
>   else
> Some("releases"  at nexus + "content/repositories/releases")
>
> }
>
>
>
> On Sep 22, 2015, at 13:26, Bin Wang  wrote:
>
> My project is using sbt (or maven), which need to download dependency from
> a maven repo. I have my own private maven repo with nexus but I don't know
> how to push my own build to it, can you give me a hint?
>
> Mark Hamstra 于2015年9月22日周二 下午1:25写道:
>
>> Yeah, whoever is maintaining the scripts and snapshot builds has fallen
>> down on the job -- but there is nothing preventing you from checking out
>> branch-1.5 and creating your own build, which is arguably a smarter thing
>> to do anyway.  If I'm going to use a non-release build, then I want the
>> full git commit history of exactly what is in that build readily available,
>> not just somewhat arbitrary JARs.
>>
>> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  wrote:
>>
>>> But I cannot find 1.5.1-SNAPSHOT either at
>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>>>
>>> Mark Hamstra 于2015年9月22日周二 下午12:55写道:
>>>
 There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
 The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1
 release candidates and then the 1.5.1 release.

 On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  wrote:

> I'd like to use some important bug fixes in 1.5 branch and I look for
> the apache maven host, but don't find any snapshot for 1.5 branch.
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>
> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
> 1.5.X?
>


>>
>


Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers,

I just ran some very simple operations on a dataset. I was surprise by the
execution plan of take(1), head() or first().

For your reference, this is what I did in pyspark 1.5:
df=sqlContext.read.parquet("someparquetfiles")
df.head()

The above lines take over 15 minutes. I was frustrated because I can do
better without using spark :) Since I like spark, so I tried to figure out
why. It seems the dataframe requires 3 stages to give me the first row. It
reads all data (which is about 1 billion rows) and run Limit twice.

Instead of head(), show(1) runs much faster. Not to mention that if I do:

df.rdd.take(1) //runs much faster.

Is this expected? Why head/first/take is so slow for dataframe? Is it a bug
in the optimizer? or I did something wrong?

Best Regards,

Jerry


Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
Hi, is there an existing way to blacklist any test suite?

Ideally we'd have a text file with a series of names (let's say comma 
separated) and if a name matches with the fully qualified class name for a 
suite, this suite will be skipped.

Perhaps we can achieve this via ScalaTest or Maven?

Currently if a number of suites are failing we're required to comment 
these out, commit and push this change then kick off a Jenkins job 
(perhaps building a custom branch) - not ideal when working with Jenkins, 
would be quicker to use such a mechanism as described above as opposed to 
having a few branches that are a little different from others.

Also, how can we quickly only run any one suite within, say, sql/hive? -f 
sql/hive/pom.xml with -nsu results in compile failures each time.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Unsubscribe

2015-09-21 Thread Richard Hillegas

To unsubscribe from the dev list, please send a message to
dev-unsubscr...@spark.apache.org as described here:
http://spark.apache.org/community.html#mailing-lists.

Thanks,
-Rick

Dulaj Viduranga  wrote on 09/21/2015 10:15:58 AM:

> From: Dulaj Viduranga 
> To: dev@spark.apache.org
> Date: 09/21/2015 10:16 AM
> Subject: Unsubscribe
>
> Unsubscribe
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.

On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam  wrote:

> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm not sure if it happens also in 1.4 for pyspark but I thought the
> pyspark code just calls the scala code via py4j. I didn't expect that this
> bug is pyspark specific. That surprises me actually a bit. I created a
> ticket for this (SPARK-10731
> ).
>
> Best Regards,
>
> Jerry
>
>
> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai  wrote:
>
>> btw, does 1.4 has the same problem?
>>
>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:
>>
>>> Hi Jerry,
>>>
>>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:
>>>
 Hi Spark Developers,

 I just ran some very simple operations on a dataset. I was surprise by
 the execution plan of take(1), head() or first().

 For your reference, this is what I did in pyspark 1.5:
 df=sqlContext.read.parquet("someparquetfiles")
 df.head()

 The above lines take over 15 minutes. I was frustrated because I can do
 better without using spark :) Since I like spark, so I tried to figure out
 why. It seems the dataframe requires 3 stages to give me the first row. It
 reads all data (which is about 1 billion rows) and run Limit twice.

 Instead of head(), show(1) runs much faster. Not to mention that if I
 do:

 df.rdd.take(1) //runs much faster.

 Is this expected? Why head/first/take is so slow for dataframe? Is it a
 bug in the optimizer? or I did something wrong?

 Best Regards,

 Jerry

>>>
>>>
>>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Yin,

You are right! I just tried the scala version with the above lines, it
works as expected.
I'm not sure if it happens also in 1.4 for pyspark but I thought the
pyspark code just calls the scala code via py4j. I didn't expect that this
bug is pyspark specific. That surprises me actually a bit. I created a
ticket for this (SPARK-10731
).

Best Regards,

Jerry


On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai  wrote:

> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I just ran some very simple operations on a dataset. I was surprise by
>>> the execution plan of take(1), head() or first().
>>>
>>> For your reference, this is what I did in pyspark 1.5:
>>> df=sqlContext.read.parquet("someparquetfiles")
>>> df.head()
>>>
>>> The above lines take over 15 minutes. I was frustrated because I can do
>>> better without using spark :) Since I like spark, so I tried to figure out
>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>
>>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>>
>>> df.rdd.take(1) //runs much faster.
>>>
>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>> bug in the optimizer? or I did something wrong?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.

On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam  wrote:

> I just noticed you found 1.4 has the same issue. I added that as well in
> the ticket.
>
> On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam  wrote:
>
>> Hi Yin,
>>
>> You are right! I just tried the scala version with the above lines, it
>> works as expected.
>> I'm not sure if it happens also in 1.4 for pyspark but I thought the
>> pyspark code just calls the scala code via py4j. I didn't expect that this
>> bug is pyspark specific. That surprises me actually a bit. I created a
>> ticket for this (SPARK-10731
>> ).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai  wrote:
>>
>>> btw, does 1.4 has the same problem?
>>>
>>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:
>>>
 Hi Jerry,

 Looks like it is a Python-specific issue. Can you create a JIRA?

 Thanks,

 Yin

 On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam 
 wrote:

> Hi Spark Developers,
>
> I just ran some very simple operations on a dataset. I was surprise by
> the execution plan of take(1), head() or first().
>
> For your reference, this is what I did in pyspark 1.5:
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
>
> The above lines take over 15 minutes. I was frustrated because I can
> do better without using spark :) Since I like spark, so I tried to figure
> out why. It seems the dataframe requires 3 stages to give me the first 
> row.
> It reads all data (which is about 1 billion rows) and run Limit twice.
>
> Instead of head(), show(1) runs much faster. Not to mention that if I
> do:
>
> df.rdd.take(1) //runs much faster.
>
> Is this expected? Why head/first/take is so slow for dataframe? Is it
> a bug in the optimizer? or I did something wrong?
>
> Best Regards,
>
> Jerry
>


>>>
>>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue.

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:

> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>
>> Thanks,
>>
>> Yin
>>
>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I just ran some very simple operations on a dataset. I was surprise by
>>> the execution plan of take(1), head() or first().
>>>
>>> For your reference, this is what I did in pyspark 1.5:
>>> df=sqlContext.read.parquet("someparquetfiles")
>>> df.head()
>>>
>>> The above lines take over 15 minutes. I was frustrated because I can do
>>> better without using spark :) Since I like spark, so I tried to figure out
>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>
>>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>>
>>> df.rdd.take(1) //runs much faster.
>>>
>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>> bug in the optimizer? or I did something wrong?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>


Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem?

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai  wrote:

> Hi Jerry,
>
> Looks like it is a Python-specific issue. Can you create a JIRA?
>
> Thanks,
>
> Yin
>
> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:
>
>> Hi Spark Developers,
>>
>> I just ran some very simple operations on a dataset. I was surprise by
>> the execution plan of take(1), head() or first().
>>
>> For your reference, this is what I did in pyspark 1.5:
>> df=sqlContext.read.parquet("someparquetfiles")
>> df.head()
>>
>> The above lines take over 15 minutes. I was frustrated because I can do
>> better without using spark :) Since I like spark, so I tried to figure out
>> why. It seems the dataframe requires 3 stages to give me the first row. It
>> reads all data (which is about 1 billion rows) and run Limit twice.
>>
>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>
>> df.rdd.take(1) //runs much faster.
>>
>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>> bug in the optimizer? or I did something wrong?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>


Unsubscribe

2015-09-21 Thread Dulaj Viduranga
Unsubscribe

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry,

Looks like it is a Python-specific issue. Can you create a JIRA?

Thanks,

Yin

On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam  wrote:

> Hi Spark Developers,
>
> I just ran some very simple operations on a dataset. I was surprise by the
> execution plan of take(1), head() or first().
>
> For your reference, this is what I did in pyspark 1.5:
> df=sqlContext.read.parquet("someparquetfiles")
> df.head()
>
> The above lines take over 15 minutes. I was frustrated because I can do
> better without using spark :) Since I like spark, so I tried to figure out
> why. It seems the dataframe requires 3 stages to give me the first row. It
> reads all data (which is about 1 billion rows) and run Limit twice.
>
> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>
> df.rdd.take(1) //runs much faster.
>
> Is this expected? Why head/first/take is so slow for dataframe? Is it a
> bug in the optimizer? or I did something wrong?
>
> Best Regards,
>
> Jerry
>


Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-21 Thread shane knapp
quick update:  we actually did some of the maintenance on our systems
after the berkeley-wide outage caused by one of our (non-jenkins)
servers halting and catching fire.

we'll still have some downtime early wednesday, but tomorrow's will be
cancelled.  i'll send out another update real soon now with what we'll
be covering on wednesday once we get our current situation more under
control.  :)

On Wed, Sep 16, 2015 at 12:15 PM, shane knapp  wrote:
>> 630am-10am thursday, 9-24-15:
>> * jenknins update to 1.629 (we're a few months behind in versions, and
>> some big bugs have been fixed)
>> * jenkins master and worker system package updates
>> * all systems get a reboot (lots of hanging java processes have been
>> building up over the months)
>> * builds will stop being accepted ~630am, and i'll kill any hangers-on
>> at 730am, and retrigger once we're done
>> * expected downtime:  3.5 hours or so
>> * i will also be testing out some of my shiny new ansible playbooks
>> for the system updates!
>>
> i forgot one thing:
>
> * moving default system java for builds from jdk1.7.0_71 to jdk1.7.0_79

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: how to send additional configuration to the RDD after it was lazily created

2015-09-21 Thread Romi Kuntsman
What new information do you know after creating the RDD, that you didn't
know at the time of it's creation?
I think the whole point is that RDD is immutable, you can't change it once
it was created.
Perhaps you need to refactor your logic to know the parameters earlier, or
create a whole new RDD again.

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Thu, Sep 17, 2015 at 10:07 AM, Gil Vernik  wrote:

> Hi,
>
> I have the following case, which i am not sure how to resolve.
>
> My code uses HadoopRDD and creates various RDDs on top of it
> (MapPartitionsRDD, and so on )
> After all RDDs were lazily created, my code "knows" some new information
> and i want that "compute" method of the HadoopRDD will be aware of it (at
> the point when "compute" method will be called).
> What is the possible way 'to send' some additional information to the
> compute method of the HadoopRDD after this RDD is lazily created?
> I tried to play with configuration, like to perform set("test","111") in
> the code and modify the compute method of HadoopRDD with get("test") - but
> of it's not working,  since SparkContext has only clone of the of the
> configuration and it can't be modified in run time.
>
> Any thoughts how can i make it?
>
> Thanks
> Gil.


Re: How to modify Hadoop APIs used by Spark?

2015-09-21 Thread Ted Yu
Can you clarify what you want to do:
If you modify existing hadoop InputFormat, etc, it would be a matter of
rebuilding hadoop and build Spark using the custom built hadoop as
dependency.

Do you introduce new InputFormat ?

Cheers

On Mon, Sep 21, 2015 at 1:20 PM, Dogtail Ray  wrote:

> Hi all,
>
> I find that Spark uses some Hadoop APIs such as InputFormat, InputSplit,
> etc., and I want to modify these Hadoop APIs. Do you know how can I
> integrate my modified Hadoop code into Spark? Great thanks!
>
>


Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list

Hi Dirceu,

The answer to whether throwing an exception is better or null is better
depends on your use case. If you are debugging and want to find bugs with
your program, you might prefer throwing an exception. However, if you are
running on a large real-world dataset (i.e. data is dirty) and your query
can take a while (e.g. 30 mins), you then might prefer the system to just
assign null values to the dirty data that could lead to runtime exceptions,
because otherwise you could be spending days just to clean your data.

Postgres throws exceptions here, but I think that's mainly because it is
used for OLTP, and in those cases queries are short-running. Most other
analytic databases I believe just return null. The best we can do is to
provide a config option to indicate behavior for exception handling.


On Fri, Sep 18, 2015 at 8:15 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Hi Yin, I got that part.
> I just think that instead of returning null, throwing an exception would
> be better. In the exception message we can explain that the DecimalType
> used can't fit the number that is been converted due to the precision and
> scale values used to create it.
> It would be easier for the user to find the reason why that error is
> happening, instead of receiving an NullPointerException in another part of
> his code. We can also make a better documentation of DecimalType classes to
> explain this behavior, what do you think?
>
>
>
>
> 2015-09-17 18:52 GMT-03:00 Yin Huai :
>
>> As I mentioned before, the range of values of DecimalType(10, 10) is [0,
>> 1). If you have a value 10.5 and you want to cast it to DecimalType(10,
>> 10), I do not think there is any better returned value except of null.
>> Looks like DecimalType(10, 10) is not the right type for your use case. You
>> need a decimal type that has precision - scale >= 2.
>>
>> On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>>
>>>
>>> Hi Yin, posted here because I think it's a bug.
>>> So, it will return null and I can get a nullpointerexception, as I was
>>> getting. Is this really the expected behavior? Never seen something
>>> returning null in other Scala tools that I used.
>>>
>>> Regards,
>>>
>>>
>>> 2015-09-14 18:54 GMT-03:00 Yin Huai :
>>>
 btw, move it to user list.

 On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai  wrote:

> A scale of 10 means that there are 10 digits at the right of the
> decimal point. If you also have precision 10, the range of your data will
> be [0, 1) and casting "10.5" to DecimalType(10, 10) will return null, 
> which
> is expected.
>
> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hi all,
>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>> It seems that there was some changes in org.apache.spark.sql.types.
>> DecimalType
>>
>> This ugly code is a little sample to reproduce the error, don't use
>> it into your project.
>>
>> test("spark test") {
>>   val file = 
>> context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => 
>> Row.fromSeq({
>> val values = f.split(",")
>> 
>> Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>> values.tail.tail.tail.head)}))
>>
>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>> StructField("int2", IntegerType, false), StructField("double",
>>
>>  DecimalType(10,10), false),
>>
>>
>> StructField("str2", StringType, false)))
>>
>>   val df = context.sqlContext.createDataFrame(file,structType)
>>   df.first
>> }
>>
>> The content of the file is:
>>
>> 1,5,10.5,va
>> 2,1,0.1,vb
>> 3,8,10.0,vc
>>
>> The problem resides in DecimalType, before 1.5 the scala wasn't
>> required. Now when using  DecimalType(12,10) it works fine, but
>> using DecimalType(10,10) the Decimal values
>> 10.5 became null, and the 0.1 works.
>>
>> Is there anybody working with DecimalType for 1.5.1?
>>
>> Regards,
>> Dirceu
>>
>>
>

>>>
>>>
>>
>


How to modify Hadoop APIs used by Spark?

2015-09-21 Thread Dogtail Ray
Hi all,

I find that Spark uses some Hadoop APIs such as InputFormat, InputSplit,
etc., and I want to modify these Hadoop APIs. Do you know how can I
integrate my modified Hadoop code into Spark? Great thanks!


Re: Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
Thanks Josh, I should have added that we've tried with -DwildcardSuites 
and Maven and we use this helpful feature regularly (although this does 
result in building plenty of tests and running other tests in other 
modules too), so wondering if there's a more "streamlined" way - e.g. with 
junit and eclipse we'd just right click one individual unit test and 
that'd be run - without building again AFAIK

Unfortunately using sbt causes a lot of pain, such as...

[error]
[error]   last tree to typer: 
Literal(Constant(org.apache.spark.sql.test.ExamplePoint))
[error]   symbol: null
[error]symbol definition: null
[error]  tpe: 
Class(classOf[org.apache.spark.sql.test.ExamplePoint])
[error]symbol owners:
[error]   context owners: class ExamplePointUDT -> package test
[error]

and then an awfully long stacktrace with plenty of errors. Must be an 
easier way...



From:   Josh Rosen 
To: Adam Roberts/UK/IBM@IBMGB
Cc: dev 
Date:   21/09/2015 19:19
Subject:Re: Test workflow - blacklist entire suites and run any 
independently



For quickly running individual suites: 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests

On Mon, Sep 21, 2015 at 8:21 AM, Adam Roberts  wrote:
Hi, is there an existing way to blacklist any test suite? 

Ideally we'd have a text file with a series of names (let's say comma 
separated) and if a name matches with the fully qualified class name for a 
suite, this suite will be skipped. 

Perhaps we can achieve this via ScalaTest or Maven? 

Currently if a number of suites are failing we're required to comment 
these out, commit and push this change then kick off a Jenkins job 
(perhaps building a custom branch) - not ideal when working with Jenkins, 
would be quicker to use such a mechanism as described above as opposed to 
having a few branches that are a little different from others. 

Also, how can we quickly only run any one suite within, say, sql/hive? -f 
sql/hive/pom.xml with -nsu results in compile failures each time.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU