[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11175:
-
Description: 
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD to achieve. However, it will mess up with streaming statistics since 
the batch will finish immediately even the jobs it launched are still running 
in another thread. This can further affect resuming from checkpoint, since all 
batches are completed right away even the actual threaded jobs may fail and 
checkpoint only resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. 
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> Therefore, I would propose to make the default behavior to just run all jobs 
> of the current batch concurrently, and mark batch completion when all the 
> jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-11175:


 Summary: Concurrent execution of JobSet within a batch in Spark 
streaming
 Key: SPARK-11175
 URL: https://issues.apache.org/jira/browse/SPARK-11175
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Yongjia Wang


Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD to achieve. However, it will mess up with streaming statistics since 
the batch will finish immediately even the jobs it launched are still running 
in another thread. This can further affect resuming from checkpoint, since all 
batches are completed right away even the actual threaded jobs may fail and 
checkpoint only resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-18 Thread patcharee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960296#comment-14960296
 ] 

patcharee edited comment on SPARK-11087 at 10/19/15 3:34 AM:
-

[~zzhan]

Below is my test. Please check. I tried to change 
"hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat 
[INFO] ORC pushdown predicate" as same as your result

2508  case class Contact(name: String, phone: String)
2509  case class Person(name: String, age: Int, contacts: Seq[Contact])
2510  val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { 
m => Contact(s"contact_$m", s"phone_$m") } )
2511  }
2512  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2513  
sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned")
2514  val peoplePartitioned = 
sqlContext.read.format("orc").load("peoplePartitioned")
2515   peoplePartitioned.registerTempTable("peoplePartitioned")

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL")
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
res5: Long = 1

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI")
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 

[jira] [Updated] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11126:
---
Priority: Blocker  (was: Critical)

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11126:
---
Affects Version/s: 1.5.0

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11152:
-
Priority: Major  (was: Minor)

> Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint 
> -
>
> Key: SPARK-11152
> URL: https://issues.apache.org/jira/browse/SPARK-11152
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Yongjia Wang
>
> When a streaming job is resumed from a checkpoint at batch time x, and say 
> the current time when we resume this streaming job is x+10. In this scenario, 
> since Spark will schedule the missing batches from x+1 to x+10 without any 
> metadata, the behavior is to pack up all the backlogged inputs into batch 
> x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. 
> This results in tiny batches that capture inputs only during the back to back 
> scheduling intervals. This behavior is very reasonable. However, the 
> streaming UI does not show correctly the input sizes for all these makeup 
> batches - they are all 0 from batch x to x+10. Fixing this would be very 
> helpful. This happens when I use Kafka direct streaming, I assume this would 
> happen for all other streaming sources as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11175:
-
Description: 
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
3. Instead of running multiple Dstreams in one streaming context, just run them 
in separate streaming context (separate Spark applications). Putting aside the 
extra deployment overhead, when working with Spark standalone cluster which 
only has FIFO scheduler across applications, the resource has to be set in 
advance and it won't automatically adjust with resizing the cluster.

Therefore, I think there is a good use case to make the default behavior just 
run all jobs of the current batch concurrently, and mark batch completion when 
all the jobs complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> 

[jira] [Commented] (SPARK-7018) Refactor dev/run-tests-jenkins into Python

2015-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962659#comment-14962659
 ] 

Apache Spark commented on SPARK-7018:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9161

> Refactor dev/run-tests-jenkins into Python
> --
>
> Key: SPARK-7018
> URL: https://issues.apache.org/jira/browse/SPARK-7018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Brennon York
>
> This issue is to specifically track the progress of the 
> {{dev/run-tests-jenkins}} script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5250:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-11176

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at 

[jira] [Created] (SPARK-11176) Umbrella ticket for wholeTextFiles + S3 bugs

2015-10-18 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11176:
--

 Summary: Umbrella ticket for wholeTextFiles + S3 bugs
 Key: SPARK-11176
 URL: https://issues.apache.org/jira/browse/SPARK-11176
 Project: Spark
  Issue Type: Umbrella
Reporter: Josh Rosen


This umbrella ticket gathers together several distinct bug reports related to 
problems using the wholeTextFiles method to read files from S3.

These issues may have a common underlying cause and should be investigated 
together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10994) Clustering coefficient computation in GraphX

2015-10-18 Thread Yang Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated SPARK-10994:
--
Description: 
We propose to implement an algorithm to compute the clustering coefficient,  in 
GraphX. The clustering coefficient of a vertex (node) in a graph quantifies how 
close its neighbours are to being a clique (complete graph). More specifically, 
the clustering coefficient C_i for a vertex v_i is given by the proportion of 
links between the vertices within its neighbourhood divided by the number of 
links that could possibly exist between them. Duncan J. Watts and Steven 
Strogatz introduced the measure in 1998 to determine whether a graph is a 
small-world network. 



  was:
We propose to implement an algorithm to compute the local clustering 
coefficient in GraphX. The local clustering coefficient of a vertex (node) in a 
graph quantifies how close its neighbors are to being a clique (complete 
graph). More specifically, the local clustering coefficient C_i for a vertex 
v_i is given by the proportion of links between the vertices within its 
neighbourhood divided by the number of links that could possibly exist between 
them. Duncan J. Watts and Steven Strogatz introduced the measure in 1998 to 
determine whether a graph is a small-world network. 




> Clustering coefficient computation in GraphX
> 
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We propose to implement an algorithm to compute the clustering coefficient,  
> in GraphX. The clustering coefficient of a vertex (node) in a graph 
> quantifies how close its neighbours are to being a clique (complete graph). 
> More specifically, the clustering coefficient C_i for a vertex v_i is given 
> by the proportion of links between the vertices within its neighbourhood 
> divided by the number of links that could possibly exist between them. Duncan 
> J. Watts and Steven Strogatz introduced the measure in 1998 to determine 
> whether a graph is a small-world network. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11126.

   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9132
[https://github.com/apache/spark/pull/9132]

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 1.6.0, 1.5.2
>
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10994) Clustering coefficient computation in GraphX

2015-10-18 Thread Yang Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated SPARK-10994:
--
Summary: Clustering coefficient computation in GraphX  (was: Local 
clustering coefficient computation in GraphX)

> Clustering coefficient computation in GraphX
> 
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We propose to implement an algorithm to compute the local clustering 
> coefficient in GraphX. The local clustering coefficient of a vertex (node) in 
> a graph quantifies how close its neighbors are to being a clique (complete 
> graph). More specifically, the local clustering coefficient C_i for a vertex 
> v_i is given by the proportion of links between the vertices within its 
> neighbourhood divided by the number of links that could possibly exist 
> between them. Duncan J. Watts and Steven Strogatz introduced the measure in 
> 1998 to determine whether a graph is a small-world network. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4414:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-11176

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> 
>
> Key: SPARK-4414
> URL: https://issues.apache.org/jira/browse/SPARK-4414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pedro Rodriguez
>Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11102.

Resolution: Duplicate

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962627#comment-14962627
 ] 

Josh Rosen commented on SPARK-11102:


This is a duplicate of SPARK-10709

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11175:
-
Description: 
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. The current behavior is that these jobs end 
up in an invisible job queue to be submitted one by one.
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such 
as from different Kafka topics) that results in multiple independent jobs for 
each batch. These jobs should better be run concurrently to maximally take 
advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to 
foreachRDD. However, it will mess up with streaming statistics since the batch 
will finish immediately even the jobs it launched are still running in another 
thread. This can further affect resuming from checkpoint, since all batches are 
completed right away even the actual threaded jobs may fail and checkpoint only 
resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the 
Jobset to wait for all threads to join, but doing this would mess up with 
closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of 
the current batch concurrently, and mark batch completion when all the jobs 
complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> Therefore, I would propose to make the default behavior to just run all jobs 
> of the current batch concurrently, and mark batch completion when all the 
> jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10655) Enhance DB2 dialect to handle XML, and DECIMAL , and DECFLOAT

2015-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962704#comment-14962704
 ] 

Apache Spark commented on SPARK-10655:
--

User 'sureshthalamati' has created a pull request for this issue:
https://github.com/apache/spark/pull/9162

> Enhance DB2 dialect to handle XML, and DECIMAL , and DECFLOAT
> -
>
> Key: SPARK-10655
> URL: https://issues.apache.org/jira/browse/SPARK-10655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Suresh Thalamati
>
> Default type mapping does not work when reading from DB2 table that contains  
> XML,  DECFLOAT  for READ , and DECIMAL type for write. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10655) Enhance DB2 dialect to handle XML, and DECIMAL , and DECFLOAT

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10655:


Assignee: (was: Apache Spark)

> Enhance DB2 dialect to handle XML, and DECIMAL , and DECFLOAT
> -
>
> Key: SPARK-10655
> URL: https://issues.apache.org/jira/browse/SPARK-10655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Suresh Thalamati
>
> Default type mapping does not work when reading from DB2 table that contains  
> XML,  DECFLOAT  for READ , and DECIMAL type for write. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10655) Enhance DB2 dialect to handle XML, and DECIMAL , and DECFLOAT

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10655:


Assignee: Apache Spark

> Enhance DB2 dialect to handle XML, and DECIMAL , and DECFLOAT
> -
>
> Key: SPARK-10655
> URL: https://issues.apache.org/jira/browse/SPARK-10655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Suresh Thalamati
>Assignee: Apache Spark
>
> Default type mapping does not work when reading from DB2 table that contains  
> XML,  DECFLOAT  for READ , and DECIMAL type for write. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11176) Umbrella ticket for wholeTextFiles + S3 bugs

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11176:
---
Component/s: Spark Core
 Input/Output

> Umbrella ticket for wholeTextFiles + S3 bugs
> 
>
> Key: SPARK-11176
> URL: https://issues.apache.org/jira/browse/SPARK-11176
> Project: Spark
>  Issue Type: Umbrella
>  Components: Input/Output, Spark Core
>Reporter: Josh Rosen
>
> This umbrella ticket gathers together several distinct bug reports related to 
> problems using the wholeTextFiles method to read files from S3.
> These issues may have a common underlying cause and should be investigated 
> together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-18 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11177:
--

 Summary: sc.wholeTextFiles throws ArrayIndexOutOfBoundsException 
when S3 file has zero bytes
 Key: SPARK-11177
 URL: https://issues.apache.org/jira/browse/SPARK-11177
 Project: Spark
  Issue Type: Sub-task
Reporter: Josh Rosen


>From a user report:

{quote}
When I upload a series of text files to an S3 directory and one of the files is 
empty (0 bytes). The sc.wholeTextFiles method stack traces.
java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
{quote}

It looks like this has been a longstanding issue:

* 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
* 
https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
* 
https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962774#comment-14962774
 ] 

Yongjia Wang commented on SPARK-11175:
--

nice. should have found this. Thank you

> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang closed SPARK-11175.

Resolution: Not A Problem

> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

2015-10-18 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962708#comment-14962708
 ] 

Saisai Shao commented on SPARK-11175:
-

There's a configuration "spark.streaming.concurrentJobs" that may satisfy your 
needs, default number is 1, which means submitting jobs one by one.

> Concurrent execution of JobSet within a batch in Spark streaming
> 
>
> Key: SPARK-11175
> URL: https://issues.apache.org/jira/browse/SPARK-11175
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such 
> as from different Kafka topics) that results in multiple independent jobs for 
> each batch. These jobs should better be run concurrently to maximally take 
> advantage of available resources. The current behavior is that these jobs end 
> up in an invisible job queue to be submitted one by one.
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to 
> foreachRDD. However, it will mess up with streaming statistics since the 
> batch will finish immediately even the jobs it launched are still running in 
> another thread. This can further affect resuming from checkpoint, since all 
> batches are completed right away even the actual threaded jobs may fail and 
> checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the 
> Jobset to wait for all threads to join, but doing this would mess up with 
> closure serialization, and make checkpoint not usable.
> 3. Instead of running multiple Dstreams in one streaming context, just run 
> them in separate streaming context (separate Spark applications). Putting 
> aside the extra deployment overhead, when working with Spark standalone 
> cluster which only has FIFO scheduler across applications, the resource has 
> to be set in advance and it won't automatically adjust with resizing the 
> cluster.
> Therefore, I think there is a good use case to make the default behavior just 
> run all jobs of the current batch concurrently, and mark batch completion 
> when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962806#comment-14962806
 ] 

Apache Spark commented on SPARK-11126:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9163

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962857#comment-14962857
 ] 

Josh Rosen commented on SPARK-5250:
---

I've spent some time trying to reproduce this under various Hadoop versions but 
haven't had any luck so far.

Based on some reports at 
https://forums.aws.amazon.com/thread.jspa?threadID=62350 and other sources, it 
sounds like this stacktrace could be caused by reading a corrupt input file. It 
sounds like the fact that the file _was_ readable through {{sc.textFile}} may 
rule this out, though.

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 

[jira] [Resolved] (SPARK-7018) Refactor dev/run-tests-jenkins into Python

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-7018.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9161
[https://github.com/apache/spark/pull/9161]

> Refactor dev/run-tests-jenkins into Python
> --
>
> Key: SPARK-7018
> URL: https://issues.apache.org/jira/browse/SPARK-7018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Brennon York
> Fix For: 1.6.0
>
>
> This issue is to specifically track the progress of the 
> {{dev/run-tests-jenkins}} script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7018) Refactor dev/run-tests-jenkins into Python

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7018:
--
Assignee: Brennon York

> Refactor dev/run-tests-jenkins into Python
> --
>
> Key: SPARK-7018
> URL: https://issues.apache.org/jira/browse/SPARK-7018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Brennon York
>Assignee: Brennon York
> Fix For: 1.6.0
>
>
> This issue is to specifically track the progress of the 
> {{dev/run-tests-jenkins}} script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-10-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962832#comment-14962832
 ] 

Josh Rosen commented on SPARK-4414:
---

I was able to reproduce this under a Hadoop 1.0.4 build of Spark and may be 
able to fix it as part of a larger patch to fix some other wholeTextFiles S3 
issues (see SPARK-11176).

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> 
>
> Key: SPARK-4414
> URL: https://issues.apache.org/jira/browse/SPARK-4414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pedro Rodriguez
>Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-4414:
-

Assignee: Josh Rosen

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> 
>
> Key: SPARK-4414
> URL: https://issues.apache.org/jira/browse/SPARK-4414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pedro Rodriguez
>Assignee: Josh Rosen
>Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5250:
--
Component/s: (was: PySpark)

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at 

[jira] [Commented] (SPARK-11178) Improve naming around task failures in scheduler code

2015-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962872#comment-14962872
 ] 

Apache Spark commented on SPARK-11178:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/9164

> Improve naming around task failures in scheduler code
> -
>
> Key: SPARK-11178
> URL: https://issues.apache.org/jira/browse/SPARK-11178
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.5.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Trivial
>
> Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new functionality 
> so that if an executor dies for a reason that's not caused by one of the 
> tasks running on the executor (e.g., due to pre-emption), Spark doesn't count 
> the failure towards the maximum number of failures for the task.  That commit 
> introduced some vague naming that I think we should fix; in particular:
> 
> (1) The variable "isNormalExit", which was used to refer to cases where the 
> executor died for a reason unrelated to the tasks running on the machine.  
> The problem with the existing name is that it's not clear (at least to me!) 
> what it means for an exit to be "normal".
> 
> (2) The variable "shouldEventuallyFailJob" is used to determine whether a 
> task's failure should be counted towards the maximum number of failures 
> allowed for a task before the associated Stage is aborted. The problem with 
> the existing name is that it can be confused with implying that the task's 
> failure should immediately cause the stage to fail because it is somehow 
> fatal (this is the case for a fetch failure, for example: if a task fails 
> because of a fetch failure, there's no point in retrying, and the whole stage 
> should be failed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11178) Improve naming around task failures in scheduler code

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11178:


Assignee: Kay Ousterhout  (was: Apache Spark)

> Improve naming around task failures in scheduler code
> -
>
> Key: SPARK-11178
> URL: https://issues.apache.org/jira/browse/SPARK-11178
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.5.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Trivial
>
> Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new functionality 
> so that if an executor dies for a reason that's not caused by one of the 
> tasks running on the executor (e.g., due to pre-emption), Spark doesn't count 
> the failure towards the maximum number of failures for the task.  That commit 
> introduced some vague naming that I think we should fix; in particular:
> 
> (1) The variable "isNormalExit", which was used to refer to cases where the 
> executor died for a reason unrelated to the tasks running on the machine.  
> The problem with the existing name is that it's not clear (at least to me!) 
> what it means for an exit to be "normal".
> 
> (2) The variable "shouldEventuallyFailJob" is used to determine whether a 
> task's failure should be counted towards the maximum number of failures 
> allowed for a task before the associated Stage is aborted. The problem with 
> the existing name is that it can be confused with implying that the task's 
> failure should immediately cause the stage to fail because it is somehow 
> fatal (this is the case for a fetch failure, for example: if a task fails 
> because of a fetch failure, there's no point in retrying, and the whole stage 
> should be failed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11178) Improve naming around task failures in scheduler code

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11178:


Assignee: Apache Spark  (was: Kay Ousterhout)

> Improve naming around task failures in scheduler code
> -
>
> Key: SPARK-11178
> URL: https://issues.apache.org/jira/browse/SPARK-11178
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.5.1
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Trivial
>
> Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new functionality 
> so that if an executor dies for a reason that's not caused by one of the 
> tasks running on the executor (e.g., due to pre-emption), Spark doesn't count 
> the failure towards the maximum number of failures for the task.  That commit 
> introduced some vague naming that I think we should fix; in particular:
> 
> (1) The variable "isNormalExit", which was used to refer to cases where the 
> executor died for a reason unrelated to the tasks running on the machine.  
> The problem with the existing name is that it's not clear (at least to me!) 
> what it means for an exit to be "normal".
> 
> (2) The variable "shouldEventuallyFailJob" is used to determine whether a 
> task's failure should be counted towards the maximum number of failures 
> allowed for a task before the associated Stage is aborted. The problem with 
> the existing name is that it can be confused with implying that the task's 
> failure should immediately cause the stage to fail because it is somehow 
> fatal (this is the case for a fetch failure, for example: if a task fails 
> because of a fetch failure, there's no point in retrying, and the whole stage 
> should be failed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11178) Improve naming around task failures in scheduler code

2015-10-18 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-11178:
--

 Summary: Improve naming around task failures in scheduler code
 Key: SPARK-11178
 URL: https://issues.apache.org/jira/browse/SPARK-11178
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.5.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Trivial


Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new functionality so 
that if an executor dies for a reason that's not caused by one of the tasks 
running on the executor (e.g., due to pre-emption), Spark doesn't count the 
failure towards the maximum number of failures for the task.  That commit 
introduced some vague naming that I think we should fix; in particular:

(1) The variable "isNormalExit", which was used to refer to cases where the 
executor died for a reason unrelated to the tasks running on the machine.  The 
problem with the existing name is that it's not clear (at least to me!) what it 
means for an exit to be "normal".

(2) The variable "shouldEventuallyFailJob" is used to determine whether a 
task's failure should be counted towards the maximum number of failures allowed 
for a task before the associated Stage is aborted. The problem with the 
existing name is that it can be confused with implying that the task's failure 
should immediately cause the stage to fail because it is somehow fatal (this is 
the case for a fetch failure, for example: if a task fails because of a fetch 
failure, there's no point in retrying, and the whole stage should be failed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962886#comment-14962886
 ] 

Josh Rosen commented on SPARK-5250:
---

I also wonder if this is an S3-specific issue.

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at 

[jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962821#comment-14962821
 ] 

Josh Rosen commented on SPARK-11177:


Reproduced it. This only occurs on Hadoop 1.x.

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-11177:
--

Assignee: Josh Rosen

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10994) Clustering coefficient computation in GraphX

2015-10-18 Thread Yang Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated SPARK-10994:
--
Description: 
The Clustering Coefficient (CC) is a fundamental measure in social (or other 
type of) network analysis assessing the degree to which nodes tend to cluster 
together. We propose to implement an algorithm to compute the clustering 
coefficient for each vertex of a given graph in GraphX.

Specifically, The clustering coefficient of a vertex (node) in a graph 
quantifies how close its neighbours are to being a clique (complete graph). 
More formally, the clustering coefficient C_i for a vertex v_i is given by the 
proportion of links between the vertices within its neighbourhood divided by 
the number of links that could possibly exist between them. 

Clustering coefficient is well known and has wide applications. Duncan J. Watts 
and Steven Strogatz introduced the measure in 1998 to determine whether a graph 
is a small-world network (1). Their paper has attacted 27266 citations by now. 
Similar features are included in NetworkX (2), SNAP (3), etc. 

(1) Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of 
‘small-world’networks." nature 393.6684 (1998): 440-442.
(2) http://networkx.github.io/
(3) http://snap.stanford.edu/


  was:
We propose to implement an algorithm to compute the clustering coefficient,  in 
GraphX. The clustering coefficient of a vertex (node) in a graph quantifies how 
close its neighbours are to being a clique (complete graph). More specifically, 
the clustering coefficient C_i for a vertex v_i is given by the proportion of 
links between the vertices within its neighbourhood divided by the number of 
links that could possibly exist between them. Duncan J. Watts and Steven 
Strogatz introduced the measure in 1998 to determine whether a graph is a 
small-world network. 




> Clustering coefficient computation in GraphX
> 
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The Clustering Coefficient (CC) is a fundamental measure in social (or other 
> type of) network analysis assessing the degree to which nodes tend to cluster 
> together. We propose to implement an algorithm to compute the clustering 
> coefficient for each vertex of a given graph in GraphX.
> Specifically, The clustering coefficient of a vertex (node) in a graph 
> quantifies how close its neighbours are to being a clique (complete graph). 
> More formally, the clustering coefficient C_i for a vertex v_i is given by 
> the proportion of links between the vertices within its neighbourhood divided 
> by the number of links that could possibly exist between them. 
> Clustering coefficient is well known and has wide applications. Duncan J. 
> Watts and Steven Strogatz introduced the measure in 1998 to determine whether 
> a graph is a small-world network (1). Their paper has attacted 27266 
> citations by now. Similar features are included in NetworkX (2), SNAP (3), 
> etc. 
> (1) Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of 
> ‘small-world’networks." nature 393.6684 (1998): 440-442.
> (2) http://networkx.github.io/
> (3) http://snap.stanford.edu/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11158:
---
Assignee: Mahmoud Lababidi

> Add more information in Error statment for sql/types _verify_type()
> ---
>
> Key: SPARK-11158
> URL: https://issues.apache.org/jira/browse/SPARK-11158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mahmoud Lababidi
>Assignee: Mahmoud Lababidi
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11126:
---
Target Version/s: 1.5.2, 1.6.0

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Shixiong Zhu
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11126:
---
Priority: Critical  (was: Major)

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11126:
---
Affects Version/s: 1.5.1

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Shixiong Zhu
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11126) A memory leak in SQLListener._stageIdToStageMetrics

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11126:
---
Assignee: Shixiong Zhu

> A memory leak in SQLListener._stageIdToStageMetrics
> ---
>
> Key: SPARK-11126
> URL: https://issues.apache.org/jira/browse/SPARK-11126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SQLListener adds all stage infos to _stageIdToStageMetrics, but only removes  
> stage infos belonging to SQL executions.
> Reported by Terry Hoo in 
> https://www.mail-archive.com/user@spark.apache.org/msg38810.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11172) Close JsonParser/Generator in test

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11172.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9157
[https://github.com/apache/spark/pull/9157]

> Close JsonParser/Generator in test
> --
>
> Key: SPARK-11172
> URL: https://issues.apache.org/jira/browse/SPARK-11172
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> JsonParser / Generator created in test should be closed.
> This is in continuation to SPARK-11124



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11172) Close JsonParser/Generator in test

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11172:
---
Assignee: Ted Yu

> Close JsonParser/Generator in test
> --
>
> Key: SPARK-11172
> URL: https://issues.apache.org/jira/browse/SPARK-11172
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> JsonParser / Generator created in test should be closed.
> This is in continuation to SPARK-11124



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11173) Cannot save data via MSSQL JDBC

2015-10-18 Thread Gianluca Salvo (JIRA)
Gianluca Salvo created SPARK-11173:
--

 Summary: Cannot save data via MSSQL JDBC
 Key: SPARK-11173
 URL: https://issues.apache.org/jira/browse/SPARK-11173
 Project: Spark
  Issue Type: Bug
  Components: Java API, PySpark, SQL
Affects Versions: 1.5.1
 Environment: Windows 7 sp1 x64, java version "1.8.0_60", Spark 1.5.1, 
hadoop 2.6, microsoft jdbc 4.2, pyspark
Reporter: Gianluca Salvo


Hello,
I'm experiencing an issue in writing dataframe via JBDC. My code is
{code:title=Example.python|borderStyle=solid}
from pyspark import SparkContext
from pyspark.sql import SQLContext
import sys

sc=SparkContext(appName="SQL Query")

sqlctx=SQLContext(sc)
serverName="SQLIPAddress"
serverPort="SQL Port"
serverUsername="username"
serverPassword="password"
serverDatabase="database"
#
connString="jdbc:sqlserver://{SERVER}:{PORT};user={USER};password={PASSWORD};databasename={DATABASENAME}"
connString=connString.format(SERVER=serverName,PORT=serverPort,USER=serverUsername,PASSWORD=serverPassword,DATABASENAME=serverDatabase)

df=sqlctx.read.format("jdbc").options(url=connString,dbtable="(select * from 
TestTable) as test_Table").load()

df.show()

try:
df.write.jdbc(connString,"Test_Target","append")
print("saving completed")
except:
print("Error in saving data",sys.exc_info()[0])

sc.stop()
{code}
Even if i specify *append*, the code throws an exception saying it is trying to 
create the table *Test_Target* but the table is already present.
If I target the script to MariaDB, all is fine
{code:title=New Connection string|borderStyle=solid}
connString="jdbc:mysql://{SERVER}:{PORT}/{DATABASENAME}?user={USER}={PASSWORD}";
{code}
The problem seems to be the Microsoft JDBC driver. Can you suggest or implement 
same workaround?

Best regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11173) Cannot save data via MSSQL JDBC

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11173.
---
Resolution: Not A Problem

This does not appear to be an issue from Spark.

> Cannot save data via MSSQL JDBC
> ---
>
> Key: SPARK-11173
> URL: https://issues.apache.org/jira/browse/SPARK-11173
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Windows 7 sp1 x64, java version "1.8.0_60", Spark 1.5.1, 
> hadoop 2.6, microsoft jdbc 4.2, pyspark
>Reporter: Gianluca Salvo
>  Labels: patch
>
> Hello,
> I'm experiencing an issue in writing dataframe via JBDC. My code is
> {code:title=Example.python|borderStyle=solid}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> import sys
> sc=SparkContext(appName="SQL Query")
> sqlctx=SQLContext(sc)
> serverName="SQLIPAddress"
> serverPort="SQL Port"
> serverUsername="username"
> serverPassword="password"
> serverDatabase="database"
> #
> connString="jdbc:sqlserver://{SERVER}:{PORT};user={USER};password={PASSWORD};databasename={DATABASENAME}"
> connString=connString.format(SERVER=serverName,PORT=serverPort,USER=serverUsername,PASSWORD=serverPassword,DATABASENAME=serverDatabase)
> df=sqlctx.read.format("jdbc").options(url=connString,dbtable="(select * from 
> TestTable) as test_Table").load()
> df.show()
> try:
> df.write.jdbc(connString,"Test_Target","append")
> print("saving completed")
> except:
> print("Error in saving data",sys.exc_info()[0])
> sc.stop()
> {code}
> Even if i specify *append*, the code throws an exception saying it is trying 
> to create the table *Test_Target* but the table is already present.
> If I target the script to MariaDB, all is fine
> {code:title=New Connection string|borderStyle=solid}
> connString="jdbc:mysql://{SERVER}:{PORT}/{DATABASENAME}?user={USER}={PASSWORD}";
> {code}
> The problem seems to be the Microsoft JDBC driver. Can you suggest or 
> implement same workaround?
> Best regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11169) Remove the extra spaces in merge script

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11169:
--
Priority: Trivial  (was: Major)

> Remove the extra spaces in merge script
> ---
>
> Key: SPARK-11169
> URL: https://issues.apache.org/jira/browse/SPARK-11169
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Trivial
>
> Our merge script now turns 
> {code}
> [SPARK-1234][SPARK-1235][SPARK-1236][SQL] description
> {code}
> into 
> {code}
> [SPARK-1234] [SPARK-1235] [SPARK-1236] [SQL] description
> {code}
> The extra spaces are more annoying in git since the first line of a git 
> commit is supposed to be very short.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10757) Java friendly constructor for distributed matrices

2015-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962335#comment-14962335
 ] 

Apache Spark commented on SPARK-10757:
--

User 'javelinjs' has created a pull request for this issue:
https://github.com/apache/spark/pull/9159

> Java friendly constructor for distributed matrices
> --
>
> Key: SPARK-10757
> URL: https://issues.apache.org/jira/browse/SPARK-10757
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently users can not construct 
> BlockMatrix/RowMatrix/IndexedRowMatrix/CoordinateMatrix at Java side because 
> that these classes did not provide java friendly constructors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11174) Typo in the GraphX programming guide

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11174:


Assignee: Apache Spark

> Typo in the GraphX programming guide
> 
>
> Key: SPARK-11174
> URL: https://issues.apache.org/jira/browse/SPARK-11174
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 1.5.1
>Reporter: Łukasz Piepiora
>Assignee: Apache Spark
>Priority: Trivial
>
> There is a small typo in the GraphX documentation. In the EdgeRDD description 
> it says "Revere" but should say "Reverse".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11174) Typo in the GraphX programming guide

2015-10-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962356#comment-14962356
 ] 

Apache Spark commented on SPARK-11174:
--

User 'lpiepiora' has created a pull request for this issue:
https://github.com/apache/spark/pull/9160

> Typo in the GraphX programming guide
> 
>
> Key: SPARK-11174
> URL: https://issues.apache.org/jira/browse/SPARK-11174
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 1.5.1
>Reporter: Łukasz Piepiora
>Priority: Trivial
>
> There is a small typo in the GraphX documentation. In the EdgeRDD description 
> it says "Revere" but should say "Reverse".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11174) Typo in the GraphX programming guide

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11174:
--
Assignee: Łukasz Piepiora

> Typo in the GraphX programming guide
> 
>
> Key: SPARK-11174
> URL: https://issues.apache.org/jira/browse/SPARK-11174
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 1.5.1
>Reporter: Łukasz Piepiora
>Assignee: Łukasz Piepiora
>Priority: Trivial
> Fix For: 1.6.0
>
>
> There is a small typo in the GraphX documentation. In the EdgeRDD description 
> it says "Revere" but should say "Reverse".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11174) Typo in the GraphX programming guide

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11174.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9160
[https://github.com/apache/spark/pull/9160]

> Typo in the GraphX programming guide
> 
>
> Key: SPARK-11174
> URL: https://issues.apache.org/jira/browse/SPARK-11174
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 1.5.1
>Reporter: Łukasz Piepiora
>Priority: Trivial
> Fix For: 1.6.0
>
>
> There is a small typo in the GraphX documentation. In the EdgeRDD description 
> it says "Revere" but should say "Reverse".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10757) Java friendly constructor for distributed matrices

2015-10-18 Thread Yizhi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962338#comment-14962338
 ] 

Yizhi Liu commented on SPARK-10757:
---

[~yanboliang] May I ask you to review my PR once you've got a minute?

> Java friendly constructor for distributed matrices
> --
>
> Key: SPARK-10757
> URL: https://issues.apache.org/jira/browse/SPARK-10757
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently users can not construct 
> BlockMatrix/RowMatrix/IndexedRowMatrix/CoordinateMatrix at Java side because 
> that these classes did not provide java friendly constructors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11147) HTTP 500 if try to access Spark UI in yarn-cluster

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11147.
---
Resolution: Not A Problem

> HTTP 500 if try to access Spark UI in yarn-cluster
> --
>
> Key: SPARK-11147
> URL: https://issues.apache.org/jira/browse/SPARK-11147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 1.5.1
> Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950)
> Spark: 1.5.x (c27e1904)
>Reporter: Sebastian YEPES FERNANDEZ
> Attachments: SparkUI.png, SparkUI.png
>
>
> Hello,
> I am facing a similar issue as described in SPARK-5837, but is my case the 
> SparkUI only work in "yarn-client" mode. If a run the same job using 
> "yarn-cluster" I get the HTTP 500 error:
> {code}
> HTTP ERROR 500
> Problem accessing /proxy/application_1444297190346_0085/. Reason:
> Connection to http://XX.XX.XX.XX:55827 refused
> Caused by:
> org.apache.http.conn.HttpHostConnectException: Connection to 
> http://XX.XX.XX.XX:55827 refused
>   at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
>   at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> {code}
> I have verified that the UI port "55827" is actually Listening on the worker 
> node, I can even run a "curl http://XX.XX.XX.XX:55827; and it redirects me to 
> another URL: http://YY.YY.YY.YY:8088/proxy/application_1444297190346_0082
> The strange thing is the its redirecting me to the app "_0082" and not the 
> actually running job "_0085"
> Does anyone have any suggestions on what could be causing this issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11174) Typo in the GraphX programming guide

2015-10-18 Thread JIRA
Łukasz Piepiora created SPARK-11174:
---

 Summary: Typo in the GraphX programming guide
 Key: SPARK-11174
 URL: https://issues.apache.org/jira/browse/SPARK-11174
 Project: Spark
  Issue Type: Bug
  Components: Documentation, GraphX
Affects Versions: 1.5.1
Reporter: Łukasz Piepiora
Priority: Trivial


There is a small typo in the GraphX documentation. In the EdgeRDD description 
it says "Revere" but should say "Reverse".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11174) Typo in the GraphX programming guide

2015-10-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962348#comment-14962348
 ] 

Sean Owen commented on SPARK-11174:
---

This isn't even worth a JIRA, but please open a pull request.

> Typo in the GraphX programming guide
> 
>
> Key: SPARK-11174
> URL: https://issues.apache.org/jira/browse/SPARK-11174
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 1.5.1
>Reporter: Łukasz Piepiora
>Priority: Trivial
>
> There is a small typo in the GraphX documentation. In the EdgeRDD description 
> it says "Revere" but should say "Reverse".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11174) Typo in the GraphX programming guide

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11174:


Assignee: (was: Apache Spark)

> Typo in the GraphX programming guide
> 
>
> Key: SPARK-11174
> URL: https://issues.apache.org/jira/browse/SPARK-11174
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, GraphX
>Affects Versions: 1.5.1
>Reporter: Łukasz Piepiora
>Priority: Trivial
>
> There is a small typo in the GraphX documentation. In the EdgeRDD description 
> it says "Revere" but should say "Reverse".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10757) Java friendly constructor for distributed matrices

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10757:


Assignee: Apache Spark

> Java friendly constructor for distributed matrices
> --
>
> Key: SPARK-10757
> URL: https://issues.apache.org/jira/browse/SPARK-10757
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently users can not construct 
> BlockMatrix/RowMatrix/IndexedRowMatrix/CoordinateMatrix at Java side because 
> that these classes did not provide java friendly constructors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10757) Java friendly constructor for distributed matrices

2015-10-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10757:


Assignee: (was: Apache Spark)

> Java friendly constructor for distributed matrices
> --
>
> Key: SPARK-10757
> URL: https://issues.apache.org/jira/browse/SPARK-10757
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently users can not construct 
> BlockMatrix/RowMatrix/IndexedRowMatrix/CoordinateMatrix at Java side because 
> that these classes did not provide java friendly constructors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11161) Viewing the web UI for the first time unpersists a cached RDD

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11161.
---
Resolution: Not A Problem

Good analysis. Yes, GC'ed RDDs are unpersisted. Given I'm 99% sure about that 
I'm pretty sure that explains this. That is, you don't "mind" it being 
unpersisted right? it's not causing a problem.

> Viewing the web UI for the first time unpersists a cached RDD
> -
>
> Key: SPARK-11161
> URL: https://issues.apache.org/jira/browse/SPARK-11161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> This one is a real head-scratcher. [Here's a 
> screencast|http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif]:
> !http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif!
> The three windows, left-to-right, are: 
> * a {{spark-shell}} on YARN with dynamic allocation enabled, at rest with one 
> executor. [Here's an example app's 
> environment|https://gist.github.com/ryan-williams/6dd3502d5d0de2f030ac].
> * [Spree|https://github.com/hammerlab/spree], opened to the above app's 
> "Storage" tab.
> * my YARN resource manager, showing a link to the web UI running on the 
> driver.
> At the start, nothing has been run in the shell, and I've not visited the web 
> UI.
> I run a simple job in the shell and cache a small RDD that it computes:
> {code}
> sc.parallelize(1 to 1, 100).map(_ % 100 -> 1).reduceByKey(_+_, 
> 100).setName("foo").cache.count
> {code}
> As the second stage runs, you can see the partitions show up as cached in 
> Spree.
> After the job finishes, a few requested executors continue to fill in, which 
> you can see in the console at left or the nav bar of Spree in the middle.
> Once that has finished, everything is at rest with the RDD "foo" 100% cached.
> Then, I click the YARN RM's "ApplicationMaster" link which loads the web UI 
> on the driver for the first time.
> Immediately, the console prints some activity, including that RDD 2 has been 
> removed:
> {code}
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 
> on 172.29.46.15:33156 in memory (size: 1517.0 B, free: 7.2 GB)
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1517.0 B, 
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 2
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 
> on 172.29.46.15:33156 in memory (size: 1666.0 B, free: 7.2 GB)
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1666.0 B, 
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 1
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned shuffle 0
> 15/10/16 21:43:13 INFO storage.BlockManager: Removing RDD 2
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned RDD 2
> {code}
> Accordingly, Spree shows that the RDD has been unpersisted, and I can see in 
> the event log (not pictured in the screencast) that an Unpersist event has 
> made its way through the various SparkListeners:
> {code}
> {"Event":"SparkListenerUnpersistRDD","RDD ID":2}
> {code}
> Simply loading the web UI causes an RDD unpersist event to fire!
> I can't nail down exactly what's causing this, and I've seen evidence that 
> there are other sequences of events that can also cause it:
> * I've repro'd the above steps ~20 times. The RDD always gets unpersisted 
> when I've not visited the web UI until the RDD is cached, and when the app is 
> dynamically allocating executors.
> * One time, I observed the unpersist to fire without my even visiting the web 
> UI at all. Other times I wait a long time before visiting the web UI, so that 
> it is clear that the loading of the web UI is causal, and it always is, but 
> apparently there's another way for the unpersist to happen, seemingly rarely, 
> without visiting the web UI.
> * I tried a couple of times without dynamic allocation and could not 
> reproduce it.
> * I've tried a couple of times with dynamic allocation and starting with a 
> higher minimum number of executors than 1 and have been unable to reproduce 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962355#comment-14962355
 ] 

Sean Owen commented on SPARK-11162:
---

Isn't it already possible? that's what the thread shows

> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10994) Local clustering coefficient computation in GraphX

2015-10-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962359#comment-14962359
 ] 

Sean Owen commented on SPARK-10994:
---

My honest guess is that this will not be merged. Please have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and the 
section on new algorithms. I'm not sure this needs to be in Spark vs just 
hosted as an external package.

> Local clustering coefficient computation in GraphX
> --
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We propose to implement an algorithm to compute the local clustering 
> coefficient in GraphX. The local clustering coefficient of a vertex (node) in 
> a graph quantifies how close its neighbors are to being a clique (complete 
> graph). More specifically, the local clustering coefficient C_i for a vertex 
> v_i is given by the proportion of links between the vertices within its 
> neighbourhood divided by the number of links that could possibly exist 
> between them. Duncan J. Watts and Steven Strogatz introduced the measure in 
> 1998 to determine whether a graph is a small-world network. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager.executorLost

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11163:
--
Target Version/s: 1.5.2, 1.6.0  (was: 1.6.0)
   Fix Version/s: (was: 1.5.2)
  (was: 1.5.1)

> Remove unnecessary addPendingTask calls in TaskSetManager.executorLost
> --
>
> Key: SPARK-11163
> URL: https://issues.apache.org/jira/browse/SPARK-11163
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> The proposed commit removes unnecessary calls to addPendingTask in
> TaskSetManager.executorLost. These calls are unnecessary: for
> tasks that are still pending and haven't been launched, they're
> still in all of the correct pending lists, so calling addPendingTask
> has no effect. For tasks that are currently running (which may still be
> in the pending lists, depending on how they were scheduled), we call
> addPendingTask in handleFailedTask, so the calls at the beginning
> of executorLost are redundant.
> I think these calls are left over from when we re-computed the locality
> levels in addPendingTask; now that we call recomputeLocality separately,
> I don't think these are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10893) Lag Analytic function broken

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10893.
---
Resolution: Duplicate

> Lag Analytic function broken
> 
>
> Key: SPARK-10893
> URL: https://issues.apache.org/jira/browse/SPARK-10893
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
> Environment: Spark Standalone Cluster on Linux
>Reporter: Jo Desmet
>
> Trying to aggregate with the LAG Analytic function gives the wrong result. In 
> my testcase it was always giving the fixed value '103079215105' when I tried 
> to run on an integer.
> Note that this only happens on Spark 1.5.0, and only when running in cluster 
> mode.
> It works fine when running on Spark 1.4.1, or when running in local mode. 
> I did not test on a yarn cluster.
> I did not test other analytic aggregates.
> Input Jason:
> {code:borderStyle=solid|title=/home/app/input.json}
> {"VAA":"A", "VBB":1}
> {"VAA":"B", "VBB":-1}
> {"VAA":"C", "VBB":2}
> {"VAA":"d", "VBB":3}
> {"VAA":null, "VBB":null}
> {code}
> Java:
> {code:borderStyle=solid}
> SparkContext sc = new SparkContext(conf);
> HiveContext sqlContext = new HiveContext(sc);
> DataFrame df = sqlContext.read().json("file:///home/app/input.json");
> 
> df = df.withColumn(
>   "previous",
>   lag(dataFrame.col("VBB"), 1)
> .over(Window.orderBy(dataFrame.col("VAA")))
>   );
> {code}
> Important to understand the conditions under which the job ran, I submitted 
> to a standalone spark cluster in client mode as follows:
> {code:borderStyle=solid}
> spark-submit \
>   --master spark:\\xx:7077 \
>   --deploy-mode client \
>   --class package.to.DriverClass \
>   --driver-java-options -Dhdp.version=2.2.0.0–2041 \
>   --num-executors 2 \
>   --driver-memory 2g \
>   --executor-memory 2g \
>   --executor-cores 2 \
>   /path/to/sample-program.jar
> {code}
> Expected Result:
> {code:borderStyle=solid}
> {"VAA":null, "VBB":null, "previous":null}
> {"VAA":"A", "VBB":1, "previous":null}
> {"VAA":"B", "VBB":-1, "previous":1}
> {"VAA":"C", "VBB":2, "previous":-1}
> {"VAA":"d", "VBB":3, "previous":2}
> {code}
> Actual Result:
> {code:borderStyle=solid}
> {"VAA":null, "VBB":null, "previous":103079215105}
> {"VAA":"A", "VBB":1, "previous":103079215105}
> {"VAA":"B", "VBB":-1, "previous":103079215105}
> {"VAA":"C", "VBB":2, "previous":103079215105}
> {"VAA":"d", "VBB":3, "previous":103079215105}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11142) org.datanucleus is already registered

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11142:
---

> org.datanucleus is already registered
> -
>
> Key: SPARK-11142
> URL: https://issues.apache.org/jira/browse/SPARK-11142
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Windows7 Home Basic
>Reporter: raicho
>Priority: Minor
>
> I first setup Spark this Wednesday on my computer. When I executed 
> spark-shell.cmd, warns shows on the screen like "org.datanucleus is already 
> registered. Ensure you don't have multiple JAR versions of the same plugin in 
> the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.10.jar" is 
> already registered and you are trying to register an identical plugin located 
> at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.10.jar"  " and 
> "org.datanucleus.api.jdo is already registered. Ensure you don't have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/c:/spark/lib/datanucleus-core-3.2.6.jar" is already registered and you 
> are trying to register an identical plugin located at URL 
> "file:/c:/spark/bin/../lib/datanucleus-core-3.2.6.jar" "
> The two URLs shown in fact mean the same path. I tried to find the classpath 
> in the configuration files but failed. No other codes outside has been 
> executed on spark yet.
> What happened and how to deal with the warn?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11142) org.datanucleus is already registered

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11142.
---
  Resolution: Fixed
   Fix Version/s: (was: 1.5.1)
Target Version/s:   (was: 1.5.1)

[~raicho] please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA. There are a number of problems with this one, like that you've 
set Fix/Target version. Mostly, however, questions are best to start on 
u...@spark.apache.org .

> org.datanucleus is already registered
> -
>
> Key: SPARK-11142
> URL: https://issues.apache.org/jira/browse/SPARK-11142
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Windows7 Home Basic
>Reporter: raicho
>Priority: Minor
>
> I first setup Spark this Wednesday on my computer. When I executed 
> spark-shell.cmd, warns shows on the screen like "org.datanucleus is already 
> registered. Ensure you don't have multiple JAR versions of the same plugin in 
> the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.10.jar" is 
> already registered and you are trying to register an identical plugin located 
> at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.10.jar"  " and 
> "org.datanucleus.api.jdo is already registered. Ensure you don't have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/c:/spark/lib/datanucleus-core-3.2.6.jar" is already registered and you 
> are trying to register an identical plugin located at URL 
> "file:/c:/spark/bin/../lib/datanucleus-core-3.2.6.jar" "
> The two URLs shown in fact mean the same path. I tried to find the classpath 
> in the configuration files but failed. No other codes outside has been 
> executed on spark yet.
> What happened and how to deal with the warn?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11142) org.datanucleus is already registered

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11142.
---
Resolution: Invalid

> org.datanucleus is already registered
> -
>
> Key: SPARK-11142
> URL: https://issues.apache.org/jira/browse/SPARK-11142
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 1.5.1
> Environment: Windows7 Home Basic
>Reporter: raicho
>Priority: Minor
>
> I first setup Spark this Wednesday on my computer. When I executed 
> spark-shell.cmd, warns shows on the screen like "org.datanucleus is already 
> registered. Ensure you don't have multiple JAR versions of the same plugin in 
> the classpath. The URL "file:/c:/spark/lib/datanucleus-core-3.2.10.jar" is 
> already registered and you are trying to register an identical plugin located 
> at URL "file:/c:/spark/bin/../lib/datanucleus-core-3.2.10.jar"  " and 
> "org.datanucleus.api.jdo is already registered. Ensure you don't have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/c:/spark/lib/datanucleus-core-3.2.6.jar" is already registered and you 
> are trying to register an identical plugin located at URL 
> "file:/c:/spark/bin/../lib/datanucleus-core-3.2.6.jar" "
> The two URLs shown in fact mean the same path. I tried to find the classpath 
> in the configuration files but failed. No other codes outside has been 
> executed on spark yet.
> What happened and how to deal with the warn?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6478) new RDD.pipeWithPartition method

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6478.
--
Resolution: Won't Fix

> new RDD.pipeWithPartition method
> 
>
> Key: SPARK-6478
> URL: https://issues.apache.org/jira/browse/SPARK-6478
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Maxim Ivanov
>Priority: Minor
>  Labels: pipe
>
> This method allows building command line args and process environement map
> using partition as an argument.
> Use case for this feature is to provide additional informatin about the 
> partition to spawned application in case where partitioner provides it (like 
> in cassandra connector or when custom partitioner/RDD is used).
> Also it provides simpler and more intuitive alternative for printPipeContext 
> function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6593.
--
Resolution: Won't Fix

> Provide option for HadoopRDD to skip corrupted files
> 
>
> Key: SPARK-6593
> URL: https://issues.apache.org/jira/browse/SPARK-6593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Dale Richardson
>Priority: Minor
>
> When reading a large amount of gzip files from HDFS eg. with  
> sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries 
> report an exception then the entire job is canceled. As default behaviour 
> this is probably for the best, but it would be nice in some circumstances 
> where you know it will be ok to have the option to skip the corrupted file 
> and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6639) Create a new script to start multiple masters

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6639.
--
Resolution: Won't Fix

> Create a new script to start multiple masters
> -
>
> Key: SPARK-6639
> URL: https://issues.apache.org/jira/browse/SPARK-6639
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 1.3.0
> Environment: all
>Reporter: Tao Wang
>Priority: Minor
>  Labels: patch
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> start-slaves.sh script is able to read from slaves file and start slaves node 
> in multiple boxes.
> However in standalone mode if I want to use multiple masters, I’ll have to 
> start masters in each individual box, and also need to provide the list of 
> masters’ hostname+port to each worker. ( start-slaves.sh only take 1 master 
> ip+port for now)
> I wonder should we create a new script called start-masters.sh to read 
> conf/masters file? Also start-slaves.sh script may need to change a little 
> bit so that master list can be passed to worker nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11169) Remove the extra spaces in merge script

2015-10-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11169.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9156
[https://github.com/apache/spark/pull/9156]

> Remove the extra spaces in merge script
> ---
>
> Key: SPARK-11169
> URL: https://issues.apache.org/jira/browse/SPARK-11169
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Our merge script now turns 
> {code}
> [SPARK-1234][SPARK-1235][SPARK-1236][SQL] description
> {code}
> into 
> {code}
> [SPARK-1234] [SPARK-1235] [SPARK-1236] [SQL] description
> {code}
> The extra spaces are more annoying in git since the first line of a git 
> commit is supposed to be very short.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10582.
---
Resolution: Won't Fix

> using dynamic-executor-allocation, if AM failed. the new AM will be started. 
> But the new AM does not allocate executors to dirver
> -
>
> Key: SPARK-10582
> URL: https://issues.apache.org/jira/browse/SPARK-10582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: KaiXinXIaoLei
>
> During running tasks, when the total number of executors is the value of 
> spark.dynamicAllocation.maxExecutors and the AM is failed. Then a new AM 
> restarts. Because in ExecutorAllocationManager, the total number of executors 
> does not changed, driver does not send RequestExecutors to AM to ask 
> executors. Then the total number of executors is the value of 
> spark.dynamicAllocation.initialExecutors . So the total number of executors 
> in driver and AM is different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7300) The temporary generated direcotries are not cleaned while calling the insertinto api in dataframe

2015-10-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7300.
--
Resolution: Won't Fix

> The temporary generated direcotries  are not cleaned while calling the 
> insertinto api in dataframe
> --
>
> Key: SPARK-7300
> URL: https://issues.apache.org/jira/browse/SPARK-7300
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: pengxu
>Priority: Minor
>
> While using insertInto function provided by DataFrame, some temporary 
> directories will be generated. Those temporary directories should be removed 
> after all the fiiles under them have already been inserted into some hive 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org