[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535703#comment-14535703
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

OK, no problem for me.

> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1197) FileSystem output connector error with some file names

2015-05-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534333#comment-14534333
 ] 

Karl Wright commented on CONNECTORS-1197:
-

second fix:
r1678329 (trunk)
r1678330 (dev_1x)


> FileSystem output connector error with some file names
> --
>
> Key: CONNECTORS-1197
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1197
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: File system connector
>Affects Versions: ManifoldCF 2.1
> Environment: Windows 7 64 bit
>Reporter: Andrea
>Assignee: Karl Wright
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> I'm having some problems trying to perform a job starting from a web crawling 
> and with a file system output connector. 
> The job is terminated with an error like the following (I think it could 
> depend on special chars in file name).
> Error: Could not create file 
> 'E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email':
>  
> E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email
>  (The filename, directory name, or volume label syntax is incorrect)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534279#comment-14534279
 ] 

Karl Wright commented on CONNECTORS-1162:
-

Consider the issue "assigned".

Unfortunately, I cannot *actually* assign the ticket without putting you on the 
Jira list of ManifoldCF committers.  So I'll hold the actual assignment, if 
that's OK with you.

> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1162:
---

Assignee: Karl Wright  (was: Rafa Haro)

> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Karl Wright
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534115#comment-14534115
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

Hi Karl,

I have been quite busy couple of weeks because of school projects. I could not 
find chance to look at a book. I will start this weekend.

> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Rafa Haro
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Tugba Dogan (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534118#comment-14534118
 ] 

Tugba Dogan commented on CONNECTORS-1162:
-

By the way can you assign this issue to me?

> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Rafa Haro
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1162) Apache Kafka Output Connector

2015-05-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534100#comment-14534100
 ] 

Karl Wright commented on CONNECTORS-1162:
-

Hi Tugba,

Haven't heard from you in a while.  How are things going?


> Apache Kafka Output Connector
> -
>
> Key: CONNECTORS-1162
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1162
> Project: ManifoldCF
>  Issue Type: Wish
>Affects Versions: ManifoldCF 1.8.1, ManifoldCF 2.0.1
>Reporter: Rafa Haro
>Assignee: Rafa Haro
>  Labels: gsoc, gsoc2015
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> Kafka is a distributed, partitioned, replicated commit log service. It 
> provides the functionality of a messaging system, but with a unique design. A 
> single Kafka broker can handle hundreds of megabytes of reads and writes per 
> second from thousands of clients.
> Apache Kafka is being used for a number of uses cases. One of them is to use 
> Kafka as a feeding system for streaming BigData processes, both in Apache 
> Spark or Hadoop environment. A Kafka output connector could be used for 
> streaming or dispatching crawled documents or metadata and put them in a 
> BigData processing pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CONNECTORS-1197) FileSystem output connector error with some file names

2015-05-08 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1197.
-
Resolution: Fixed

> FileSystem output connector error with some file names
> --
>
> Key: CONNECTORS-1197
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1197
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: File system connector
>Affects Versions: ManifoldCF 2.1
> Environment: Windows 7 64 bit
>Reporter: Andrea
>Assignee: Karl Wright
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> I'm having some problems trying to perform a job starting from a web crawling 
> and with a file system output connector. 
> The job is terminated with an error like the following (I think it could 
> depend on special chars in file name).
> Error: Could not create file 
> 'E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email':
>  
> E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email
>  (The filename, directory name, or volume label syntax is incorrect)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1197) FileSystem output connector error with some file names

2015-05-08 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1197:

Fix Version/s: ManifoldCF 2.2
   ManifoldCF 1.10

> FileSystem output connector error with some file names
> --
>
> Key: CONNECTORS-1197
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1197
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: File system connector
>Affects Versions: ManifoldCF 2.1
> Environment: Windows 7 64 bit
>Reporter: Andrea
>Assignee: Karl Wright
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> I'm having some problems trying to perform a job starting from a web crawling 
> and with a file system output connector. 
> The job is terminated with an error like the following (I think it could 
> depend on special chars in file name).
> Error: Could not create file 
> 'E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email':
>  
> E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email
>  (The filename, directory name, or volume label syntax is incorrect)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1197) FileSystem output connector error with some file names

2015-05-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534086#comment-14534086
 ] 

Karl Wright commented on CONNECTORS-1197:
-

r1678300 (trunk)
r1678301 (dev_1x)


> FileSystem output connector error with some file names
> --
>
> Key: CONNECTORS-1197
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1197
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: File system connector
>Affects Versions: ManifoldCF 2.1
> Environment: Windows 7 64 bit
>Reporter: Andrea
>Assignee: Karl Wright
> Fix For: ManifoldCF 1.10, ManifoldCF 2.2
>
>
> I'm having some problems trying to perform a job starting from a web crawling 
> and with a file system output connector. 
> The job is terminated with an error like the following (I think it could 
> depend on special chars in file name).
> Error: Could not create file 
> 'E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email':
>  
> E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email
>  (The filename, directory name, or volume label syntax is incorrect)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1197) FileSystem output connector error with some file names

2015-05-08 Thread Andrea (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534022#comment-14534022
 ] 

Andrea commented on CONNECTORS-1197:


>From my point of view the first solution would be better, the second one is 
>valid but I wouldn't get some documents in output without a specific "crawling 
>reason"...does it make sense to you?

> FileSystem output connector error with some file names
> --
>
> Key: CONNECTORS-1197
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1197
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: File system connector
>Affects Versions: ManifoldCF 2.1
> Environment: Windows 7 64 bit
>Reporter: Andrea
>Assignee: Karl Wright
>
> I'm having some problems trying to perform a job starting from a web crawling 
> and with a file system output connector. 
> The job is terminated with an error like the following (I think it could 
> depend on special chars in file name).
> Error: Could not create file 
> 'E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email':
>  
> E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email
>  (The filename, directory name, or volume label syntax is incorrect)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1197) FileSystem output connector error with some file names

2015-05-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534014#comment-14534014
 ] 

Karl Wright commented on CONNECTORS-1197:
-

Hi Andrea,

It is not possible to just detect a failure and then modify the document name 
when detected, for many reasons.  One of them is that we don't get back good 
feedback from Java as to what is wrong exactly with the filename.  The other 
reason is that the connector also has to handle document deletion, which has an 
entirely different error structure.

Your only choices are therefore the following:
(1) A special "windows" mode, which does an entirely different character 
mapping and where no attempt is made to be wget compliant at all;
(2) Skipping any files whose names cause hard errors on write.

Thanks.

> FileSystem output connector error with some file names
> --
>
> Key: CONNECTORS-1197
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1197
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: File system connector
>Affects Versions: ManifoldCF 2.1
> Environment: Windows 7 64 bit
>Reporter: Andrea
>Assignee: Karl Wright
>
> I'm having some problems trying to perform a job starting from a web crawling 
> and with a file system output connector. 
> The job is terminated with an error like the following (I think it could 
> depend on special chars in file name).
> Error: Could not create file 
> 'E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email':
>  
> E:\ManifoldCF\http\nypost.com\2015\05\06\bloombergs-the-man-to-beat-hillary-for-democratic-nomination?msg=fail&shared=email
>  (The filename, directory name, or volume label syntax is incorrect)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)