Re: Inconsistency for nullvalue handling CSV: see SPARK-16462, SPARK-16460, SPARK-15144, SPARK-17290 and SPARK-16903

2016-08-29 Thread Nicholas Chammas
I wish JIRA would automatically show you potentially similar issues as you
are typing up a new one, like Stack Overflow does...

It would really help cut down on duplicate reports.

On Mon, Aug 29, 2016 at 10:55 PM Hyukjin Kwon  wrote:

> Hi all,
>
>
> PR:
> https://github.com/apache/spark/pull/14118
>
> JIRAs
> https://issues.apache.org/jira/browse/SPARK-17290
> https://issues.apache.org/jira/browse/SPARK-16903
> https://issues.apache.org/jira/browse/SPARK-16462
> https://issues.apache.org/jira/browse/SPARK-16460
> https://issues.apache.org/jira/browse/SPARK-15144
>
> It seems this is not critical but the duplicated issues are being opened
> due to this.
>
> Also, it seems many users are affected by this. I hope this email make
> some interests for reviewers.
>
>
> Thanks.
>


Inconsistency for nullvalue handling CSV: see SPARK-16462, SPARK-16460, SPARK-15144, SPARK-17290 and SPARK-16903

2016-08-29 Thread Hyukjin Kwon
Hi all,


PR:
https://github.com/apache/spark/pull/14118

JIRAs
https://issues.apache.org/jira/browse/SPARK-17290
https://issues.apache.org/jira/browse/SPARK-16903
https://issues.apache.org/jira/browse/SPARK-16462
https://issues.apache.org/jira/browse/SPARK-16460
https://issues.apache.org/jira/browse/SPARK-15144

It seems this is not critical but the duplicated issues are being opened
due to this.

Also, it seems many users are affected by this. I hope this email make some
interests for reviewers.


Thanks.


Re: Real time streaming in Spark

2016-08-29 Thread Luciano Resende
There were some prototypes/discussions being done on top of Spark
Streaming, and they were discussing how that would fit with regards to
Structured Streaming which was in design mode at that time. See
https://issues.apache.org/jira/browse/SPARK-14745 for some details and link
to PR.

On Mon, Aug 29, 2016 at 1:13 PM, Tomasz Gawęda 
wrote:

> Hi everyone,
>
>
> I wonder if there are plans to implement real time streaming in Spark. I
> see that in Spark 2.0 Trigger can have more implementations than
> ProcessingTime.
>
>
> In my opinion Real Time streaming (so reaction on every event - like
> continous queries in Apache Ignite) will be very useful and will fill gap
> that is currently in Spark. Now, if we must implement both real-time
> streaming and batch jobs, the streaming must be done in other frameworks as
> Spark allows us only to process event in Micro Batches. Matei Zaharia
> wrote in Databricks blog about  Continuous Applications [1], in my
> opinion adding EventTrigger will be next big step to Continuous
> Applications.
>
>
> What do you think about it? Are there any plans to implement such
> event-based trigger? Of course I can help with implementation, however I'm
> just starting learning Spark internals and it will take a while before I
> would be able to write something.
>
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> [1] https://databricks.com/blog/2016/07/28/continuous-
> applications-evolving-streaming-in-apache-spark-2-0.html
>
> 
> Continuous Applications: Evolving Streaming in Apache Spark 2.0
> 
> databricks.com
> Apache Spark 2.0 lays the foundation for Continuous Applications, a
> simplified and unified way to write end-to-end streaming applications that
> reacts to data in real-time.
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Saving less data to improve Pregel performance in GraphX?

2016-08-29 Thread Fang Zhang
Dear developers,

I am running some tests using Pregel API. 

It seems to me that more than 90% of the volume of a graph object is
composed of index structures that will not change during the execution of
Pregel. When the size of a graph is too huge to fit in memory, Pregel will
persist intermediate graphs on disk each time, which seems to involve a lot
of repeated disk savings.

In my test(Shortest Path), I save only one copy of the initial graph and
maintain only a var of RDD[(VertexID, VD)]. To create new messages, I create
a new graph using updated RDD[(VertexId, VD)] and the fixed data in initial
graph during each iteration. Using a slow NTFS hard drive, I did observe
around 40% overall improvement. Note my updateVertices(corresponding to
joinVertices) and edges.upgrade are not optimized yet (they can be optimized
following the follow of GraphX) and the improvement should be from I/O.

So my question is: do you think the current flow of Pregel could be improved
by saving a small portion of a large Graph object? If there are other
concerns, could you explain them?

Best regards,
Fang



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Saving-less-data-to-improve-Pregel-performance-in-GraphX-tp18762.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



KMeans calls takeSample() twice?

2016-08-29 Thread gsamaras
After reading the internal code of Spark about it, I wasn't able to
understand why it calls takeSample() twice? Can someone please explain?

There is a relevant  StackOverflow question

 
.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/KMeans-calls-takeSample-twice-tp18761.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Real time streaming in Spark

2016-08-29 Thread Tomasz Gawęda
Hi everyone,


I wonder if there are plans to implement real time streaming in Spark. I see 
that in Spark 2.0 Trigger can have more implementations than ProcessingTime.


In my opinion Real Time streaming (so reaction on every event - like continous 
queries in Apache Ignite) will be very useful and will fill gap that is 
currently in Spark. Now, if we must implement both real-time streaming and 
batch jobs, the streaming must be done in other frameworks as Spark allows us 
only to process event in Micro Batches. Matei Zaharia wrote in Databricks blog 
about  Continuous Applications [1], in my opinion adding EventTrigger will be 
next big step to Continuous Applications.


What do you think about it? Are there any plans to implement such event-based 
trigger? Of course I can help with implementation, however I'm just starting 
learning Spark internals and it will take a while before I would be able to 
write something.


Pozdrawiam / Best regards,

Tomek


[1] 
https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

[https://databricks.com/wp-content/uploads/2016/07/spark-2-continuous-apps-OG.png]

Continuous Applications: Evolving Streaming in Apache Spark 
2.0
databricks.com
Apache Spark 2.0 lays the foundation for Continuous Applications, a simplified 
and unified way to write end-to-end streaming applications that reacts to data 
in real-time.




Re: Performance of loading parquet files into case classes in Spark

2016-08-29 Thread Julien Dumazert
Hi Maciek,

I followed your recommandation and benchmarked Dataframes aggregations on 
Dataset. Here is what I got:

// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)
// Dataset with map and Dataframes sum
// 35.372s
df.as[A].map(_.fieldToSum).agg(sum("value")).collect().head.getAs[Long](0)

Not much of a difference. It seems that as soon as you access data as in RDDs, 
you force the full decoding of the object into a case class, which is super 
costly.

I find this behavior quite normal: as soon as you provide the user with the 
ability to pass a blackbox function, anything can happen, so you have to load 
the whole object. On the other hand, when using SQL-style functions only, 
everything is "white box", so Spark understands what you want to do and can 
optimize.

Still, it breaks the promise of Datasets to me, and I hope there is something 
to do here (not confident on this point), and that it will be addressed in a 
later release.

Best regards,
Julien


> Le 28 août 2016 à 22:12, Maciej Bryński  a écrit :
> 
> Hi Julien,
> I thought about something like this:
> import org.apache.spark.sql.functions.sum
> df.as[A].map(_.fieldToSum).agg(sum("value")).collect()
> To try using Dataframes aggregation on Dataset instead of reduce.
> 
> Regards,
> Maciek
> 
> 2016-08-28 21:27 GMT+02:00 Julien Dumazert  >:
> Hi Maciek,
> 
> I've tested several variants for summing "fieldToSum":
> 
> First, RDD-style code:
> df.as[A].map(_.fieldToSum).reduce(_ + _)
> df.as[A].rdd.map(_.fieldToSum).sum()
> df.as[A].map(_.fieldToSum).rdd.sum()
> All around 30 seconds. "reduce" and "sum" seem to have the same performance, 
> for this use case at least.
> 
> Then with sql.functions.sum:
> df.agg(sum("fieldToSum")).collect().head.getAs[Long](0)
> 0.24 seconds, super fast.
> 
> Finally, dataset with column selection:
> df.as[A].select($"fieldToSum".as[Long]).reduce(_ + _)
> 0.18 seconds, super fast again.
> 
> (I've also tried replacing my sums and reduces by counts on your advice, but 
> the performance is unchanged. Apparently, summing does not take much time 
> compared to accessing data.)
> 
> It seems that we need to use the SQL interface to reach the highest level of 
> performance, which somehow breaks the promise of Dataset (preserving type 
> safety and having Catalyst and Tungsten performance like datasets).
> 
> As for direct access to Row, it seems that it got much slower from 1.6 to 
> 2.0. I guess, it's because of the fact that Dataframe is now Dataset[Row], 
> and thus uses the same encoding/decoding mechanism as for any other case 
> class.
> 
> Best regards,
> 
> Julien
> 
>> Le 27 août 2016 à 22:32, Maciej Bryński > > a écrit :
>> 
>> 
>> 2016-08-27 15:27 GMT+02:00 Julien Dumazert > >:
>> df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)
>> 
>> I think reduce and sum has very different performance. 
>> Did you try sql.functions.sum ?
>> Or of you want to benchmark access to Row object then  count() function will 
>> be better idea.
>> 
>> Regards,
>> -- 
>> Maciek Bryński
> 
> 
> 
> 
> -- 
> Maciek Bryński



Re: Structured Streaming with Kafka sources/sinks

2016-08-29 Thread Fred Reiss
I think that the community really needs some feedback on the progress of
this very important task. Many existing Spark Streaming applications can't
be ported to Structured Streaming without Kafka support.

Is there a design document somewhere?  Or can someone from the DataBricks
team break down the existing monolithic JIRA issue into smaller steps that
reflect the current development plan?

Fred


On Sat, Aug 27, 2016 at 2:32 PM, Koert Kuipers  wrote:

> thats great
>
> is this effort happening anywhere that is publicly visible? github?
>
> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin  wrote:
>
>> We (the team at Databricks) are working on one currently.
>>
>>
>> On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger 
>> wrote:
>>
>>> https://issues.apache.org/jira/browse/SPARK-15406
>>>
>>> I'm not working on it (yet?), never got an answer to the question of
>>> who was planning to work on it.
>>>
>>> On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao 
>>> wrote:
>>> > Hi all,
>>> >
>>> >
>>> >
>>> > I’m trying to write Structured Streaming test code and will deal with
>>> Kafka
>>> > source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
>>> >
>>> >
>>> >
>>> > I found some Databricks slides saying that Kafka sources/sinks will be
>>> > implemented in Spark 2.0, so is there anybody working on this? And
>>> when will
>>> > it be released?
>>> >
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Chenzhao Guo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: spark roadmap

2016-08-29 Thread Mark Hamstra
At this point, there is no target date set for 2.1.  That's something that
we should do fairly soon, but right now there is at least a little room for
discussion as to whether we want to continue with the same pace of releases
that we targeted throughout the 1.x development cycles, or whether
lengthening the release cycles by a month or two might be better (mainly by
reducing the overhead fraction that comes from the constant-size
engineering mechanics of coordinating and making a release.)

On Mon, Aug 29, 2016 at 1:23 AM, Denis Bolshakov 
wrote:

> Hello spark devs,
>
> Does any one can provide a roadmap for the nearest two months?
> Or at least say when we can expect 2.1 release and which features will be
> added?
>
>
> --
> //with Best Regards
> --Denis Bolshakov
> e-mail: bolshakov.de...@gmail.com
>


[build system] jenkins wedged itself this weekend, just restarted

2016-08-29 Thread shane knapp
jenkins got in to one of it's "states" and wasn't accepting new builds
starting this past saturday night.  i restarted it, and now it's
catching up on the weekend's queue.

shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Broadcast Variable Life Cycle

2016-08-29 Thread Sean Owen
Yes you want to actively unpersist() or destroy() broadcast variables
when they're no longer needed. They can eventually be removed when the
reference on the driver is garbage collected, but you usually would
not want to rely on that.

On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam  wrote:
> Hello spark developers,
>
> Anyone can shed some lights on the life cycle of the broadcast variables?
> Basically, if I have a broadcast variable defined in a loop and for each
> iteration, I provide a different value.
> // For example:
> for(i< 1 to 10) {
> val bc = sc.broadcast(i)
> sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
> i)}.toDF("id", "i").write.parquet("/dummy_output")
> }
>
> Do I need to active manage the broadcast variable in this case? I know this
> example is not real but please imagine this broadcast variable can hold an
> array of 1M Long.
>
> Regards,
>
> Jerry
>
>
>
> On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam  wrote:
>>
>> Hello spark developers,
>>
>> Can someone explain to me what is the lifecycle of a broadcast variable?
>> When a broadcast variable will be garbage-collected at the driver-side and
>> at the executor-side? Does a spark application need to actively manage the
>> broadcast variables to ensure that it will not run in OOM?
>>
>> Best Regards,
>>
>> Jerry
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Remaining folders in .sparkStaging directory after app was killed

2016-08-29 Thread Artur Sukhenko
Hello spark devs,

Whenever I  run spark app in yarn-cluster mode, do Ctrl+C to stop
spark-submit and yarn application -kill
I have remaining folders in hdfs:
.sparkStaging/application_1472140614688_0001
.sparkStaging/application_1472140614688_0002

Those folders will never be deleted?
And if so, the only way to clean them is to manually remove them from hdfs?


---
Sincerely,
Artur Sukhenko


Re: Broadcast Variable Life Cycle

2016-08-29 Thread Jerry Lam
Hello spark developers,

Anyone can shed some lights on the life cycle of the broadcast variables?
Basically, if I have a broadcast variable defined in a loop and for each
iteration, I provide a different value.
// For example:
for(i< 1 to 10) {
val bc = sc.broadcast(i)
sc.parallelize(Seq(1,2,3)).map{id => val i = bc.value; (id,
i)}.toDF("id", "i").write.parquet("/dummy_output")
}

Do I need to active manage the broadcast variable in this case? I know this
example is not real but please imagine this broadcast variable can hold an
array of 1M Long.

Regards,

Jerry



On Sun, Aug 21, 2016 at 1:07 PM, Jerry Lam  wrote:

> Hello spark developers,
>
> Can someone explain to me what is the lifecycle of a broadcast variable?
> When a broadcast variable will be garbage-collected at the driver-side and
> at the executor-side? Does a spark application need to actively manage the
> broadcast variables to ensure that it will not run in OOM?
>
> Best Regards,
>
> Jerry
>


Re: Spark 2.0 and Yarn

2016-08-29 Thread Saisai Shao
This archive contains all the jars required by Spark runtime, you could zip
all the jars under /jars and upload this archive to HDFS, then
configure spark.yarn.archive with the path of this archive on HDFS.

On Sun, Aug 28, 2016 at 9:59 PM, Srikanth Sampath  wrote:

> Hi,
> With SPARK-11157, the big fat assembly jar build was removed.
>
> Has anyone used spark.yarn.archive - the alternative provided and
> successfully deployed Spark on a Yarn cluster.  If so, what does the
> archive
> contain.  What should be the minimal set.  Any suggestion is greatly
> appreciated.
>
> Thanks,
> -Srikanth
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-2-0-and-Yarn-tp18748.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


spark roadmap

2016-08-29 Thread Denis Bolshakov
Hello spark devs,

Does any one can provide a roadmap for the nearest two months?
Or at least say when we can expect 2.1 release and which features will be
added?


-- 
//with Best Regards
--Denis Bolshakov
e-mail: bolshakov.de...@gmail.com