Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Sean Owen
Sure, it's only an issue insofar as it may be a flaky test. If it's fixable
or disable-able for a possible next RC that could be helpful.

On Sat, Dec 10, 2016 at 2:09 AM Shixiong(Ryan) Zhu 
wrote:

> Sean, "stress test for failOnDataLoss=false" is because Kafka consumer
> may be thrown NPE when a topic is deleted. I added some logic to retry on
> such failure, however, it may still fail when topic deletion is too
> frequent (the stress test). Just reopened
> https://issues.apache.org/jira/browse/SPARK-18588.
>
> Anyway, this is just a best effort to deal with Kafka issue, and in
> practice, people won't delete topic frequently, so this is not a release
> blocker.
>
>
>
>


Document Similarity -Spark Mllib

2016-12-09 Thread satyajit vegesna
Hi ALL,

I am trying to implement a mlllib spark job, to find the similarity between
documents(for my case is basically home addess).

i believe i cannot use DIMSUM for my use case as, DIMSUM is works well only
with matrix with thin columns and more rows in matrix.

matrix example format, for my use case:

 doc1(address1)  doc2(address2) .. m is
going to be huge as i have more add.
  san mateo 0.73462 0
  san fransico   ..   ..
  san bruno   ....
   .
   .
   .
   .
 and n is going to be thin compared to m

I would like to know if there is way to leverage DIMSUM to work on my use
case, and if not what other alogrithm i can try that is available in spark
mlllib.

Regards,
Satyajit.


Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Cody Koeninger
Agree that frequent topic deletion is not a very Kafka-esque thing to do

On Fri, Dec 9, 2016 at 12:09 PM, Shixiong(Ryan) Zhu
 wrote:
> Sean, "stress test for failOnDataLoss=false" is because Kafka consumer may
> be thrown NPE when a topic is deleted. I added some logic to retry on such
> failure, however, it may still fail when topic deletion is too frequent (the
> stress test). Just reopened
> https://issues.apache.org/jira/browse/SPARK-18588.
>
> Anyway, this is just a best effort to deal with Kafka issue, and in
> practice, people won't delete topic frequently, so this is not a release
> blocker.
>
> On Fri, Dec 9, 2016 at 2:55 AM, Sean Owen  wrote:
>>
>> As usual, the sigs / hashes are fine and licenses look fine.
>>
>> I am still seeing some test failures. A few I've seen over time and aren't
>> repeatable, but a few seem persistent. ANyone else observed these? I'm on
>> Ubuntu 16 / Java 8 building for -Pyarn -Phadoop-2.7 -Phive
>>
>> If anyone can confirm I'll investigate the cause if I can. I'd hesitate to
>> support the release yet unless the build is definitely passing for others.
>>
>>
>> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.281 sec
>> <<< ERROR!
>> java.lang.NoSuchMethodError:
>> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
>> at test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
>>
>>
>>
>> - caching on disk *** FAILED ***
>>   java.util.concurrent.TimeoutException: Can't find 2 executors before
>> 3 milliseconds elapsed
>>   at
>> org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:584)
>>   at
>> org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154)
>>   at
>> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
>>   at
>> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
>>   at
>> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
>>   at
>> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>>   ...
>>
>>
>> - stress test for failOnDataLoss=false *** FAILED ***
>>   org.apache.spark.sql.streaming.StreamingQueryException: Query [id =
>> 3b191b78-7f30-46d3-93f8-5fbeecce94a2, runId =
>> 0cab93b6-19d8-47a7-88ad-d296bea72405] terminated with exception: null
>>   at
>> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:262)
>>   at
>> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:160)
>>   ...
>>   Cause: java.lang.NullPointerException:
>>   ...
>>
>>
>>
>> On Thu, Dec 8, 2016 at 4:40 PM Reynold Xin  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.0-rc2
>>> (080717497365b83bc202ab16812ced93eb1ea7bd)
>>>
>>> List of JIRA tickets resolved are:
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1217
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>>>
>>>
>>> (Note that the docs and staging repo are still being uploaded and will be
>>> available soon)
>>>
>>>
>>> ===
>>> How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.1.0?
>>> ===
>>> Committers should l

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Shixiong(Ryan) Zhu
Sean, "stress test for failOnDataLoss=false" is because Kafka consumer may
be thrown NPE when a topic is deleted. I added some logic to retry on such
failure, however, it may still fail when topic deletion is too frequent
(the stress test). Just reopened
https://issues.apache.org/jira/browse/SPARK-18588.

Anyway, this is just a best effort to deal with Kafka issue, and in
practice, people won't delete topic frequently, so this is not a release
blocker.

On Fri, Dec 9, 2016 at 2:55 AM, Sean Owen  wrote:

> As usual, the sigs / hashes are fine and licenses look fine.
>
> I am still seeing some test failures. A few I've seen over time and aren't
> repeatable, but a few seem persistent. ANyone else observed these? I'm on
> Ubuntu 16 / Java 8 building for -Pyarn -Phadoop-2.7 -Phive
>
> If anyone can confirm I'll investigate the cause if I can. I'd hesitate to
> support the release yet unless the build is definitely passing for others.
>
>
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.281 sec
>  <<< ERROR!
> java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.
> JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)
> Lscala/Tuple2;
> at test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
>
>
>
> - caching on disk *** FAILED ***
>   java.util.concurrent.TimeoutException: Can't find 2 executors before
> 3 milliseconds elapsed
>   at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(
> JobProgressListener.scala:584)
>   at org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$
> testCaching(DistributedSuite.scala:154)
>   at org.apache.spark.DistributedSuite$$anonfun$32$$
> anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
>   at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(
> DistributedSuite.scala:191)
>   at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(
> DistributedSuite.scala:191)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(
> Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   ...
>
>
> - stress test for failOnDataLoss=false *** FAILED ***
>   org.apache.spark.sql.streaming.StreamingQueryException: Query [id =
> 3b191b78-7f30-46d3-93f8-5fbeecce94a2, runId = 
> 0cab93b6-19d8-47a7-88ad-d296bea72405]
> terminated with exception: null
>   at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$
> spark$sql$execution$streaming$StreamExecution$$runBatches(
> StreamExecution.scala:262)
>   at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(
> StreamExecution.scala:160)
>   ...
>   Cause: java.lang.NullPointerException:
>   ...
>
>
>
> On Thu, Dec 8, 2016 at 4:40 PM Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.0-rc2 (080717497365b83bc202ab16812ced
>> 93eb1ea7bd)
>>
>> List of JIRA tickets resolved are:  https://issues.apache.
>> org/jira/issues/?jql=project%20%3D%20SPARK%20AND%
>> 20fixVersion%20%3D%202.1.0
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1217
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>>
>>
>> (Note that the docs and staging repo are still being uploaded and will be
>> available soon)
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.1.0?
>> ===
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>>
>


Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-09 Thread Dongjoon Hyun
Thank you for the opinion, Dongjin!


On Thu, Dec 8, 2016 at 21:56 Dongjin Lee  wrote:

> +1 For this idea. I need it also.
>
> Regards,
> Dongjin
>
> On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun  wrote:
>
> Hi, All.
>
>
>
>
>
> Could you give me some opinion?
>
>
>
>
>
> There is an old SPARK issue, SPARK-11374, about removing header lines from
> text file.
>
>
> Currently, Spark supports removing CSV header lines by the following way.
>
>
>
>
>
> ```
>
>
> scala> spark.read.option("header","true").csv("/data").show
>
>
> +---+---+
>
>
> | c1| c2|
>
>
> +---+---+
>
>
> |  1|  a|
>
>
> |  2|  b|
>
>
> +---+---+
>
>
> ```
>
>
>
>
>
> In SQL world, we can support that like the Hive way,
> `skip.header.line.count`.
>
>
>
>
>
> ```
>
>
> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
> TBLPROPERTIES('skip.header.line.count'='1')")
>
>
> scala> sql("SELECT * FROM t1").show
>
>
> +---+-+
>
>
> | id|value|
>
>
> +---+-+
>
>
> |  1|a|
>
>
> |  2|b|
>
>
> +---+-+
>
>
> ```
>
>
>
>
>
> Although I made a PR for this based on the JIRA issue, I want to know this
> is really needed feature.
>
>
> Is it need for your use cases? Or, it's enough for you to remove them in a
> preprocessing stage.
>
>
> If this is too old and not proper in these days, I'll close the PR and
> JIRA issue as WON'T FIX.
>
>
>
>
>
> Thank you for all in advance!
>
>
>
>
>
> Bests,
>
>
> Dongjoon.
>
>
>
>
>
> -
>
>
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
>
>
>
>
> --
> *Dongjin Lee*
>
>
> *Software developer in Line+.So interested in massive-scale machine
> learning.facebook: www.facebook.com/dongjin.lee.kr
> linkedin: 
> kr.linkedin.com/in/dongjinleekr
> github:
> github.com/dongjinleekr
> twitter: www.twitter.com/dongjinleekr
> *
>
>
>


Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Sean Owen
As usual, the sigs / hashes are fine and licenses look fine.

I am still seeing some test failures. A few I've seen over time and aren't
repeatable, but a few seem persistent. ANyone else observed these? I'm on
Ubuntu 16 / Java 8 building for -Pyarn -Phadoop-2.7 -Phive

If anyone can confirm I'll investigate the cause if I can. I'd hesitate to
support the release yet unless the build is definitely passing for others.


udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.281 sec
 <<< ERROR!
java.lang.NoSuchMethodError:
org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
at test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)



- caching on disk *** FAILED ***
  java.util.concurrent.TimeoutException: Can't find 2 executors before
3 milliseconds elapsed
  at
org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:584)
  at org.apache.spark.DistributedSuite.org
$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154)
  at
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
  at
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
  at
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
  at
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  ...


- stress test for failOnDataLoss=false *** FAILED ***
  org.apache.spark.sql.streaming.StreamingQueryException: Query [id =
3b191b78-7f30-46d3-93f8-5fbeecce94a2, runId =
0cab93b6-19d8-47a7-88ad-d296bea72405] terminated with exception: null
  at org.apache.spark.sql.execution.streaming.StreamExecution.org
$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:262)
  at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:160)
  ...
  Cause: java.lang.NullPointerException:
  ...



On Thu, Dec 8, 2016 at 4:40 PM Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc2
> (080717497365b83bc202ab16812ced93eb1ea7bd)
>
> List of JIRA tickets resolved are:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1217
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>
>
> (Note that the docs and staging repo are still being uploaded and will be
> available soon)
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>


Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Reynold Xin
I uploaded a new one:
https://repository.apache.org/content/repositories/orgapachespark-1219/



On Thu, Dec 8, 2016 at 11:42 PM, Prashant Sharma 
wrote:

> I am getting 404 for Link https://repository.apache.org/content/
> repositories/orgapachespark-1217.
>
> --Prashant
>
>
> On Fri, Dec 9, 2016 at 10:43 AM, Michael Allman 
> wrote:
>
>> I believe https://github.com/apache/spark/pull/16122 needs to be
>> included in Spark 2.1. It's a simple bug fix to some functionality that is
>> introduced in 2.1. Unfortunately, it's been manually verified only. There's
>> no unit test that covers it, and building one is far from trivial.
>>
>> Michael
>>
>>
>>
>>
>> On Dec 8, 2016, at 12:39 AM, Reynold Xin  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.0-rc2 (080717497365b83bc202ab16812ce
>> d93eb1ea7bd)
>>
>> List of JIRA tickets resolved are:  https://issues.apache.or
>> g/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1217
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>>
>>
>> (Note that the docs and staging repo are still being uploaded and will be
>> available soon)
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.1.0?
>> ===
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>>
>>
>>
>


Re: java.lang.IllegalStateException: There is no space for new record

2016-12-09 Thread Liang-Chi Hsieh

Hi Nick,

I think it is due to a bug in UnsafeKVExternalSorter. I created a Jira and a
PR for this bug:

https://issues.apache.org/jira/browse/SPARK-18800







-
Liang-Chi Hsieh | @viirya
Spark Technology Center
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-tp20108p20190.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org